Bank marketing campaign analysis

Jose Caloca - 110558

07/05/2021

Introduction.

In recent years, machine learning has become very important in the business world, as the intelligent use of data analytics is key to business success. For this project we will using the Bank Marketing Dataset from a portuguese bank, this dataset was originally uploaded to UCI’s Machine Learning Repository. This provides information on a marketing campaign that offers the results of contacts made offering time deposits from the financial institution in which it will be necessary to analyze and find future strategies to improve in future campaigns. A term deposit is a deposit offered by a bank or financial institution at a fixed rate (often better than simply opening a deposit account) in which your money will be returned to you at a specified time of maturity.

The aim of this project is to predict if the client will subscribe (yes/no) a term deposit (variable y), and to determine the factors behind a sucessful marketing campaign, and get a grasp of the features that influence on the probability of subscribing to a term deposit.

For this project we will be using R and Python at the same time in Rstudio - Rmarkdown.

Data description.

Load R packages and Python Modules

We will be using the following R packages and Python modules, we will load them as follows:

- R Packages:

# load R libraries
if (!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if (!require(DataExplorer)) install.packages("DataExplorer", repos = "http://cran.us.r-project.org")
if (!require(htmltools)) install.packages("htmltools", repos = "http://cran.us.r-project.org")
if (!require(ggstatsplot)) install.packages("ggstatsplot", repos = "http://cran.us.r-project.org")
if (!require(plotly)) install.packages("plotly", repos = "http://cran.us.r-project.org")
if (!require(rmdformats)) install.packages("rmdformats", repos = "http://cran.us.r-project.org")

- Python Modules:

#load Python modules
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import sweetviz as sv

import xgboost as xgb
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.model_selection import cross_val_predict, cross_val_score

from sklearn.metrics import confusion_matrix, classification_report, precision_recall_fscore_support, roc_curve, roc_auc_score, accuracy_score, recall_score, precision_score

from sklearn import metrics
from sklearn.metrics import confusion_matrix

Load dataset

The dataset used for this project can be found by clicking here. More specifically the dataset that we will load is the following: bank-additional-full.csv.

For loading the file we will call the read_csv function from pandas. It is it important to mention that there must be a folder called dataset in your main project folder that contains the aforementioned dataset. Or preferably it can be downloaded directly from the GitHub repository.

#Import dataset
dataset = pd.read_csv("dataset/bank-additional-full.csv", sep = ";")

Before starting our analysis, we must recode the output variable to a binary class (1 and 0), instead of “yes” and “no” strings.

dataset['y'] = dataset['y'].apply(lambda x: 0 if x =='no' else 1)
dataset.rename(columns = {"y" : "deposit"}, inplace = True)

The following chart shows a big picture of the dataset:

This dataset has 100% of complete rows, and has no missing values, nor missing columns. Hence, no imputation techniques are needed in any of the variables. Almost the half of the columns are numeric. In overall this is not a heavy dataset since it only occupies 6.6 Mb of memory.

The dataset has 21 columns and 41188 rows. The variables have the following attributes:

\(~\)

Bank client data:

\(~\)

1. - age (numeric)

2. - job : type of job (categorical: ‘admin.’‘blue-collar,’‘entrepreneur,’‘housemaid,’‘management,’‘retired,’‘self-employed,’‘services,’‘student,’‘technician,’‘unemployed,’‘unknown’)

3. - marital : marital status (categorical: ‘divorced,’‘married,’‘single,’‘unknown’; note: ‘divorced’ means divorced or widowed)

4. - education (categorical: ‘basic.4y,’‘basic.6y,’‘basic.9y,’‘high.school,’‘illiterate,’‘professional.course,’‘university.degree,’‘unknown’)

5. - default: has credit in default? (categorical: ‘no,’‘yes,’‘unknown’)

6. - housing: has housing loan? (categorical: ‘no,’‘yes,’‘unknown’)

7. - loan: has personal loan? (categorical: ‘no,’‘yes,’‘unknown’)

\(~\)

Other attributes:

\(~\)

12. - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13. - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14. - previous: number of contacts performed before this campaign and for this client (numeric)

15. - poutcome: outcome of the previous marketing campaign (categorical: ‘failure,’‘nonexistent,’‘success’)

\(~\)

Social and economic context attributes

\(~\)

16. - emp.var.rate: employment variation rate - quarterly indicator (numeric)

17. - cons.price.idx: consumer price index - monthly indicator (numeric)

18. - cons.conf.idx: consumer confidence index - monthly indicator (numeric)

19. - euribor3m: euribor 3 month rate - daily indicator (numeric)

20. - nr.employed: number of employees - quarterly indicator (numeric)

\(~\)

Output variable (desired target):

\(~\)

21. - deposit - has the client subscribed a term deposit? (binary: ‘yes,’‘no’)

dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null  float64
 18  euribor3m       41188 non-null  float64
 19  nr.employed     41188 non-null  float64
 20  deposit         41188 non-null  int64  
dtypes: float64(5), int64(6), object(10)
memory usage: 6.6+ MB

From the above description we can see that the variable pdays (number of days that passed by after the client was last contacted from a previous campaign), those customers that were not previously contacted had a value of 999. To this we are going to recode this variable for zeros (0).

dataset['pdays'] = dataset['pdays'].apply(lambda x: 0 if x ==999 else x)

Exploratory Data Analysis

The exploratory data analysis (EDA) or descriptive statistics is a preliminary and essential step when it comes to understanding the data with which we are going to work and highly recommended for a correct research methodology.

The objective of this analysis is to explore, describe, summarize and visualize the nature of the data collected in the random variables of the project or research of interest, through the application of simple data summary techniques and graphic methods without assuming assumptions for their interpretation.

For the creation of the EDA graphs we will be using the Python library sweetviz. Sweetviz is a library that generates nice-looking, high-density visualizations to kickstart EDA with just two lines of code. Output is a fully self-contained HTML application. The output is saved a HTML file in the project folder, and will be loaded to Rmarkdown to render by calling the function includeHTML from the htmltools package in R.

First, we look at some main summary statistics of our dataset and get a picture of the distribution of each variables.

      age                 job            marital     
 Min.   :17.00   admin.     :10422   divorced: 4612  
 1st Qu.:32.00   blue-collar: 9254   married :24928  
 Median :38.00   technician : 6743   single  :11568  
               education        default         housing           loan      
 university.degree  :12168   no     :32588   no     :18622   no     :33950  
 high.school        : 9515   unknown: 8597   unknown:  990   unknown:  990  
 basic.9y           : 6045   yes    :    3   yes    :21576   yes    : 6248  
      contact          month       day_of_week    duration     
 cellular :26144   may    :13769   fri:7827    Min.   :   0.0  
 telephone:15044   jul    : 7174   mon:8514    1st Qu.: 102.0  
                   aug    : 6178   thu:8623    Median : 180.0  
    campaign          pdays            previous            poutcome    
 Min.   : 1.000   Min.   : 0.0000   Min.   :0.000   failure    : 4252  
 1st Qu.: 1.000   1st Qu.: 0.0000   1st Qu.:0.000   nonexistent:35563  
 Median : 2.000   Median : 0.0000   Median :0.000   success    : 1373  
  emp.var.rate      cons.price.idx  cons.conf.idx     euribor3m    
 Min.   :-3.40000   Min.   :92.20   Min.   :-50.8   Min.   :0.634  
 1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344  
 Median : 1.10000   Median :93.75   Median :-41.8   Median :4.857  
  nr.employed   deposit  
 Min.   :4964   0:36548  
 1st Qu.:5099   1: 4640  
 Median :5191            
 [ reached getOption("max.print") -- omitted 4 rows ]

From the above table we can state the most of the individuals have an admin and technician job position. Most of the clients (more than the half) are married. More than 50% of the customers have at least completed the high-school.

32588 of the customers in the campaign haven’t default in previous financial services. A bit more than the half of the customers have their own house, and far more of them have an own phone. The ownership of a phone might be not relevant nowadays, but years ago this was an important issue.

In average, employees change their jobs around 8% per year in a deflationary context (average CPI of 93.58).

Regarding the visualisation part of our EDA, First we create the report and export it the project folder.

#EDA using Autoviz
dataset_eda = sv.analyze(dataset)

#Saving results to HTML file
dataset_eda.show_html('Exploratory_Data_Analysis.html')

Second, we load the HTML file in the Rmarkdown notebook interface. We will go feature by feature in the following sections to see the range of values they have, how customers are distributed among these.

DataFrame
NO COMPARISON TARGET
41188
ROWS
12
DUPLICATES
29.6 MB
RAM
21
FEATURES
13
CATEGORICAL
8
NUMERICAL
0
TEXT
2.1.0
Get updates, docs & report issues here

Created & maintained by Francois Bertrand
Graphic design by Jean-Francois Hains
1
age
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
78
(<1%)
ZEROES:
---
MAX
98.0
95%
58.0
Q3
47.0
AVG
40.0
MEDIAN
38.0
Q1
32.0
5%
26.0
MIN
17.0
RANGE
81.0
IQR
15.0
STD
10.4
VAR
109
KURT.
0.791
SKEW
0.785
SUM
1.6M
2
job
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
12
(<1%)
3
marital
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
4
(<1%)
4
education
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
8
(<1%)
5
default
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
3
(<1%)
6
housing
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
3
(<1%)
7
loan
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
3
(<1%)
8
contact
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
2
(<1%)
9
month
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
10
(<1%)
10
day_of_week
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
5
(<1%)
11
duration
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
1,544
(4%)
ZEROES:
4
(<1%)
MAX
4,918
95%
753
Q3
319
AVG
258
MEDIAN
180
Q1
102
5%
36
MIN
0
RANGE
4,918
IQR
217
STD
259
VAR
67,226
KURT.
20.2
SKEW
3.26
SUM
10.6M
12
campaign
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
42
(<1%)
ZEROES:
---
MAX
56.0
95%
7.0
Q3
3.0
AVG
2.6
MEDIAN
2.0
Q1
1.0
5%
1.0
MIN
1.0
RANGE
55.0
IQR
2.00
STD
2.77
VAR
7.67
KURT.
37.0
SKEW
4.76
SUM
106k
13
pdays
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
26
(<1%)
ZEROES:
39,688
(96%)
MAX
27.0
95%
0.0
Q3
0.0
AVG
0.2
MEDIAN
0.0
Q1
0.0
5%
0.0
MIN
0.0
RANGE
27.0
IQR
0.00
STD
1.35
VAR
1.82
KURT.
76.4
SKEW
7.94
SUM
9,112
14
previous
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
8
(<1%)
15
poutcome
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
3
(<1%)
16
emp.var.rate
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
10
(<1%)
17
cons.price.idx
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
26
(<1%)
ZEROES:
---
MAX
94.77
95%
94.47
Q3
93.99
MEDIAN
93.75
AVG
93.58
Q1
93.08
5%
92.71
MIN
92.20
RANGE
2.57
IQR
0.919
STD
0.579
VAR
0.335
KURT.
-0.830
SKEW
-0.231
SUM
3.9M
18
cons.conf.idx
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
26
(<1%)
ZEROES:
---
MAX
-26.9
95%
-33.6
Q3
-36.4
AVG
-40.5
MEDIAN
-41.8
Q1
-42.7
5%
-47.1
MIN
-50.8
RANGE
23.9
IQR
6.30
STD
4.63
VAR
21.4
KURT.
-0.359
SKEW
0.303
SUM
-1.7M
19
euribor3m
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
316
(<1%)
ZEROES:
---
MAX
5.04
95%
4.97
Q3
4.96
MEDIAN
4.86
AVG
3.62
Q1
1.34
5%
0.80
MIN
0.63
RANGE
4.41
IQR
3.62
STD
1.73
VAR
3.01
KURT.
-1.41
SKEW
-0.709
SUM
149k
20
nr.employed
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
11
(<1%)
ZEROES:
---
MAX
5,228
95%
5,228
Q3
5,228
MEDIAN
5,191
AVG
5,167
Q1
5,099
5%
5,018
MIN
4,964
RANGE
264
IQR
129
STD
72.3
VAR
5,220
KURT.
-0.004
SKEW
-1.04
SUM
212.8M
21
deposit
VALUES:
41,188
(100%)
MISSING:
---
DISTINCT:
2
(<1%)
Associations
[Only including dataset "DataFrame"] Squares are categorical associations (uncertainty coefficient & correlation ratio) from 0 to 1. The uncertainty coefficient is assymmetrical, (approximating how much the elements on the left PROVIDE INFORMATION on elements in the row). Circles are the symmetrical numerical correlations (Pearson's) from -1 to 1. The trivial diagonal is intentionally left blank for clarity.
Associations
[Only including dataset "None"] Squares are categorical associations (uncertainty coefficient & correlation ratio) from 0 to 1. The uncertainty coefficient is assymmetrical, (approximating how much the elements on the left PROVIDE INFORMATION on elements in the row). Circles are the symmetrical numerical correlations (Pearson's) from -1 to 1. The trivial diagonal is intentionally left blank for clarity.
age
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

cons.conf.idx
0.13
pdays
0.02
nr.employed
-0.02
euribor3m
0.01
campaign
0.00
duration
-0.00
cons.price.idx
0.00

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

job
0.50
marital
0.42
education
0.27
default
0.17
month
0.13
emp.var.rate
0.12
day_of_week
0.04
poutcome
0.04
previous
0.03
deposit
0.03
loan
0.01
contact
0.01
housing
0.00
MOST FREQUENT VALUES

31
1,947
4.7%
32
1,846
4.5%
33
1,833
4.5%
36
1,780
4.3%
35
1,759
4.3%
34
1,745
4.2%
30
1,714
4.2%
37
1,475
3.6%
29
1,453
3.5%
39
1,432
3.5%
38
1,407
3.4%
41
1,278
3.1%
40
1,161
2.8%
42
1,142
2.8%
45
1,103
2.7%
SMALLEST VALUES

17
5
<0.1%
18
28
<0.1%
19
42
0.1%
20
65
0.2%
21
102
0.2%
22
137
0.3%
23
226
0.5%
24
463
1.1%
25
598
1.5%
26
698
1.7%
27
851
2.1%
28
1,001
2.4%
29
1,453
3.5%
30
1,714
4.2%
31
1,947
4.7%
LARGEST VALUES

98
2
<0.1%
95
1
<0.1%
94
1
<0.1%
92
4
<0.1%
91
2
<0.1%
89
2
<0.1%
88
22
<0.1%
87
1
<0.1%
86
8
<0.1%
85
15
<0.1%
84
7
<0.1%
83
17
<0.1%
82
17
<0.1%
81
20
<0.1%
80
31
<0.1%
job
MISSING:
---
TOP CATEGORIES

admin.
10,422
25%
blue-collar
9,254
22%
technician
6,743
16%
services
3,969
10%
management
2,924
7%
retired
1,720
4%
entrepreneur
1,456
4%
self-employed
1,421
3%
housemaid
1,060
3%
unemployed
1,014
2%
student
875
2%
unknown
330
<1%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
job
PROVIDES INFORMATION ON...

education
0.23
marital
0.06
default
0.04
emp.var.rate
0.03
month
0.03
deposit
0.03
poutcome
0.02
previous
0.02
contact
0.01
loan
0.00
day_of_week
0.00
housing
0.00

THESE FEATURES
GIVE INFORMATION
ON job:

education
0.20
month
0.03
marital
0.02
emp.var.rate
0.02
default
0.01
deposit
0.00
poutcome
0.00
contact
0.00
previous
0.00
day_of_week
0.00
housing
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
job
CORRELATION RATIO WITH...

age
0.50
nr.employed
0.23
euribor3m
0.19
cons.conf.idx
0.16
pdays
0.12
cons.price.idx
0.12
campaign
0.03
duration
0.03
marital
MISSING:
---
TOP CATEGORIES

married
24,928
61%
single
11,568
28%
divorced
4,612
11%
unknown
80
<1%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
marital
PROVIDES INFORMATION ON...

job
0.02
default
0.02
education
0.01
emp.var.rate
0.01
deposit
0.00
contact
0.00
previous
0.00
poutcome
0.00
month
0.00
housing
0.00
day_of_week
0.00
loan
0.00

THESE FEATURES
GIVE INFORMATION
ON marital:

job
0.06
education
0.02
default
0.01
emp.var.rate
0.01
month
0.00
contact
0.00
previous
0.00
deposit
0.00
poutcome
0.00
day_of_week
0.00
housing
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
marital
CORRELATION RATIO WITH...

age
0.42
euribor3m
0.11
nr.employed
0.10
cons.price.idx
0.06
cons.conf.idx
0.06
pdays
0.04
campaign
0.01
duration
0.01
education
MISSING:
---
TOP CATEGORIES

university.degree
12,168
30%
high.school
9,515
23%
basic.9y
6,045
15%
professional.course
5,243
13%
basic.4y
4,176
10%
basic.6y
2,292
6%
unknown
1,731
4%
illiterate
18
<1%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
education
PROVIDES INFORMATION ON...

job
0.20
default
0.05
marital
0.02
month
0.02
contact
0.01
emp.var.rate
0.01
deposit
0.01
poutcome
0.00
previous
0.00
day_of_week
0.00
housing
0.00
loan
0.00

THESE FEATURES
GIVE INFORMATION
ON education:

job
0.23
month
0.02
default
0.02
marital
0.01
emp.var.rate
0.01
contact
0.00
deposit
0.00
poutcome
0.00
previous
0.00
day_of_week
0.00
housing
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
education
CORRELATION RATIO WITH...

age
0.27
cons.conf.idx
0.12
cons.price.idx
0.10
nr.employed
0.06
euribor3m
0.05
pdays
0.05
duration
0.02
campaign
0.01
default
MISSING:
---
TOP CATEGORIES

no
32,588
79%
unknown
8,597
21%
yes
3
<1%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
default
PROVIDES INFORMATION ON...

emp.var.rate
0.02
deposit
0.02
education
0.02
poutcome
0.02
previous
0.01
contact
0.01
job
0.01
marital
0.01
month
0.01
housing
0.00
day_of_week
0.00
loan
0.00

THESE FEATURES
GIVE INFORMATION
ON default:

emp.var.rate
0.06
education
0.05
job
0.04
month
0.03
marital
0.02
contact
0.02
poutcome
0.01
previous
0.01
deposit
0.01
day_of_week
0.00
housing
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
default
CORRELATION RATIO WITH...

euribor3m
0.20
nr.employed
0.19
cons.price.idx
0.17
age
0.17
pdays
0.07
campaign
0.03
cons.conf.idx
0.03
duration
0.01
housing
MISSING:
---
TOP CATEGORIES

yes
21,576
52%
no
18,622
45%
unknown
990
2%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
housing
PROVIDES INFORMATION ON...

loan
0.21
contact
0.01
emp.var.rate
0.00
month
0.00
previous
0.00
poutcome
0.00
default
0.00
education
0.00
deposit
0.00
day_of_week
0.00
job
0.00
marital
0.00

THESE FEATURES
GIVE INFORMATION
ON housing:

loan
0.15
contact
0.00
emp.var.rate
0.00
month
0.00
previous
0.00
job
0.00
education
0.00
poutcome
0.00
day_of_week
0.00
default
0.00
marital
0.00
deposit
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
housing
CORRELATION RATIO WITH...

cons.price.idx
0.08
euribor3m
0.06
nr.employed
0.05
cons.conf.idx
0.03
campaign
0.01
duration
0.01
pdays
0.00
age
0.00
loan
MISSING:
---
TOP CATEGORIES

no
33,950
82%
yes
6,248
15%
unknown
990
2%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
loan
PROVIDES INFORMATION ON...

housing
0.15
contact
0.00
month
0.00
emp.var.rate
0.00
previous
0.00
job
0.00
default
0.00
education
0.00
day_of_week
0.00
marital
0.00
poutcome
0.00
deposit
0.00

THESE FEATURES
GIVE INFORMATION
ON loan:

housing
0.21
month
0.00
emp.var.rate
0.00
job
0.00
contact
0.00
education
0.00
day_of_week
0.00
previous
0.00
default
0.00
marital
0.00
poutcome
0.00
deposit
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
loan
CORRELATION RATIO WITH...

cons.price.idx
0.02
cons.conf.idx
0.02
age
0.01
campaign
0.01
duration
0.00
nr.employed
0.00
pdays
0.00
euribor3m
0.00
contact
MISSING:
---
TOP CATEGORIES

cellular
26,144
63%
telephone
15,044
37%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
contact
PROVIDES INFORMATION ON...

emp.var.rate
0.17
month
0.11
poutcome
0.08
previous
0.07
deposit
0.03
default
0.02
housing
0.00
education
0.00
job
0.00
marital
0.00
day_of_week
0.00
loan
0.00

THESE FEATURES
GIVE INFORMATION
ON contact:

emp.var.rate
0.41
month
0.31
previous
0.06
poutcome
0.06
deposit
0.02
default
0.01
job
0.01
education
0.01
housing
0.01
marital
0.00
day_of_week
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
contact
CORRELATION RATIO WITH...

cons.price.idx
0.59
euribor3m
0.40
nr.employed
0.27
cons.conf.idx
0.25
pdays
0.10
campaign
0.08
duration
0.03
age
0.01
month
MISSING:
---
TOP CATEGORIES

may
13,769
33%
jul
7,174
17%
aug
6,178
15%
jun
5,318
13%
nov
4,101
10%
apr
2,632
6%
oct
718
2%
sep
570
1%
mar
546
1%
dec
182
<1%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
month
PROVIDES INFORMATION ON...

emp.var.rate
0.69
contact
0.31
poutcome
0.10
previous
0.10
deposit
0.08
default
0.03
job
0.03
education
0.02
day_of_week
0.01
marital
0.00
housing
0.00
loan
0.00

THESE FEATURES
GIVE INFORMATION
ON month:

emp.var.rate
0.61
contact
0.11
job
0.03
poutcome
0.03
previous
0.03
education
0.02
deposit
0.01
default
0.01
day_of_week
0.01
marital
0.00
housing
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
month
CORRELATION RATIO WITH...

nr.employed
0.65
cons.conf.idx
0.61
cons.price.idx
0.61
euribor3m
0.58
pdays
0.20
campaign
0.16
age
0.13
duration
0.07
day_of_week
MISSING:
---
TOP CATEGORIES

thu
8,623
21%
mon
8,514
21%
wed
8,134
20%
tue
8,090
20%
fri
7,827
19%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
day_of_week
PROVIDES INFORMATION ON...

month
0.01
emp.var.rate
0.00
contact
0.00
deposit
0.00
previous
0.00
poutcome
0.00
education
0.00
job
0.00
housing
0.00
default
0.00
marital
0.00
loan
0.00

THESE FEATURES
GIVE INFORMATION
ON day_of_week:

month
0.01
emp.var.rate
0.00
contact
0.00
education
0.00
job
0.00
previous
0.00
deposit
0.00
marital
0.00
housing
0.00
poutcome
0.00
default
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
day_of_week
CORRELATION RATIO WITH...

cons.conf.idx
0.07
euribor3m
0.04
campaign
0.04
age
0.04
nr.employed
0.03
duration
0.03
cons.price.idx
0.02
pdays
0.01
duration
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

campaign
-0.07
pdays
0.05
nr.employed
-0.04
euribor3m
-0.03
cons.conf.idx
-0.01
cons.price.idx
0.01
age
-0.00

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

deposit
0.41
month
0.07
emp.var.rate
0.06
poutcome
0.05
day_of_week
0.03
job
0.03
previous
0.03
contact
0.03
education
0.02
marital
0.01
default
0.01
housing
0.01
loan
0.00
MOST FREQUENT VALUES

85
170
0.4%
90
170
0.4%
136
168
0.4%
73
167
0.4%
124
164
0.4%
87
162
0.4%
72
161
0.4%
104
161
0.4%
111
160
0.4%
106
159
0.4%
109
158
0.4%
97
158
0.4%
122
157
0.4%
135
156
0.4%
92
156
0.4%
SMALLEST VALUES

0
4
<0.1%
1
3
<0.1%
2
1
<0.1%
3
3
<0.1%
4
12
<0.1%
5
30
<0.1%
6
37
<0.1%
7
54
0.1%
8
69
0.2%
9
77
0.2%
10
72
0.2%
11
81
0.2%
12
65
0.2%
13
77
0.2%
14
70
0.2%
LARGEST VALUES

4918
1
<0.1%
4199
1
<0.1%
3785
1
<0.1%
3643
1
<0.1%
3631
1
<0.1%
3509
1
<0.1%
3422
1
<0.1%
3366
1
<0.1%
3322
1
<0.1%
3284
1
<0.1%
3253
1
<0.1%
3183
1
<0.1%
3094
1
<0.1%
3078
1
<0.1%
3076
1
<0.1%
campaign
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

nr.employed
0.14
euribor3m
0.14
cons.price.idx
0.13
duration
-0.07
pdays
-0.04
cons.conf.idx
-0.01
age
0.00

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

emp.var.rate
0.18
month
0.16
poutcome
0.09
previous
0.09
contact
0.08
deposit
0.07
day_of_week
0.04
default
0.03
job
0.03
marital
0.01
housing
0.01
education
0.01
loan
0.01
MOST FREQUENT VALUES

1
17,642
42.8%
2
10,570
25.7%
3
5,341
13.0%
4
2,651
6.4%
5
1,599
3.9%
6
979
2.4%
7
629
1.5%
8
400
1.0%
9
283
0.7%
10
225
0.5%
11
177
0.4%
12
125
0.3%
13
92
0.2%
14
69
0.2%
17
58
0.1%
SMALLEST VALUES

1
17,642
42.8%
2
10,570
25.7%
3
5,341
13.0%
4
2,651
6.4%
5
1,599
3.9%
6
979
2.4%
7
629
1.5%
8
400
1.0%
9
283
0.7%
10
225
0.5%
11
177
0.4%
12
125
0.3%
13
92
0.2%
14
69
0.2%
15
51
0.1%
LARGEST VALUES

56
1
<0.1%
43
2
<0.1%
42
2
<0.1%
41
1
<0.1%
40
2
<0.1%
39
1
<0.1%
37
1
<0.1%
35
5
<0.1%
34
3
<0.1%
33
4
<0.1%
32
4
<0.1%
31
7
<0.1%
30
7
<0.1%
29
10
<0.1%
28
8
<0.1%
pdays
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

nr.employed
-0.32
euribor3m
-0.25
cons.conf.idx
0.06
duration
0.05
campaign
-0.04
cons.price.idx
-0.04
age
0.02

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

poutcome
0.74
previous
0.50
emp.var.rate
0.37
deposit
0.27
month
0.20
job
0.12
contact
0.10
default
0.07
education
0.05
marital
0.04
day_of_week
0.01
housing
0.00
loan
0.00
MOST FREQUENT VALUES

0
39,688
96.4%
3
439
1.1%
6
412
1.0%
4
118
0.3%
9
64
0.2%
2
61
0.1%
7
60
0.1%
12
58
0.1%
10
52
0.1%
5
46
0.1%
13
36
<0.1%
11
28
<0.1%
1
26
<0.1%
15
24
<0.1%
14
20
<0.1%
SMALLEST VALUES

0
39,688
96.4%
1
26
<0.1%
2
61
0.1%
3
439
1.1%
4
118
0.3%
5
46
0.1%
6
412
1.0%
7
60
0.1%
8
18
<0.1%
9
64
0.2%
10
52
0.1%
11
28
<0.1%
12
58
0.1%
13
36
<0.1%
14
20
<0.1%
LARGEST VALUES

27
1
<0.1%
26
1
<0.1%
25
1
<0.1%
22
3
<0.1%
21
2
<0.1%
20
1
<0.1%
19
3
<0.1%
18
7
<0.1%
17
8
<0.1%
16
11
<0.1%
15
24
<0.1%
14
20
<0.1%
13
36
<0.1%
12
58
0.1%
11
28
<0.1%
previous
MISSING:
---
TOP CATEGORIES

0
35,563
86%
1
4,561
11%
2
754
2%
3
216
<1%
4
70
<1%
5
18
<1%
6
5
<1%
7
1
<1%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
previous
PROVIDES INFORMATION ON...

poutcome
0.85
emp.var.rate
0.10
contact
0.06
deposit
0.05
month
0.03
default
0.01
job
0.00
marital
0.00
education
0.00
housing
0.00
day_of_week
0.00
loan
0.00

THESE FEATURES
GIVE INFORMATION
ON previous:

poutcome
0.83
emp.var.rate
0.32
month
0.10
contact
0.07
deposit
0.04
job
0.02
default
0.01
education
0.00
marital
0.00
housing
0.00
day_of_week
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
previous
CORRELATION RATIO WITH...

nr.employed
0.52
pdays
0.50
euribor3m
0.49
cons.price.idx
0.33
cons.conf.idx
0.14
campaign
0.09
age
0.03
duration
0.03
poutcome
MISSING:
---
TOP CATEGORIES

nonexistent
35,563
86%
failure
4,252
10%
success
1,373
3%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
poutcome
PROVIDES INFORMATION ON...

previous
0.83
emp.var.rate
0.10
deposit
0.09
contact
0.06
month
0.03
default
0.01
job
0.00
marital
0.00
education
0.00
housing
0.00
day_of_week
0.00
loan
0.00

THESE FEATURES
GIVE INFORMATION
ON poutcome:

previous
0.85
emp.var.rate
0.33
month
0.10
contact
0.08
deposit
0.06
job
0.02
default
0.02
education
0.00
marital
0.00
housing
0.00
day_of_week
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
poutcome
CORRELATION RATIO WITH...

pdays
0.74
nr.employed
0.51
euribor3m
0.49
cons.price.idx
0.31
cons.conf.idx
0.18
campaign
0.09
duration
0.05
age
0.04
emp.var.rate
MISSING:
---
TOP CATEGORIES

1.4
16,234
39%
-1.8
9,184
22%
1.1
7,763
19%
-0.1
3,683
9%
-2.9
1,663
4%
-3.4
1,071
3%
-1.7
773
2%
-1.1
635
2%
-3.0
172
<1%
-0.2
10
<1%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
emp.var.rate
PROVIDES INFORMATION ON...

month
0.61
contact
0.41
poutcome
0.33
previous
0.32
deposit
0.16
default
0.06
job
0.02
education
0.01
marital
0.01
housing
0.00
day_of_week
0.00
loan
0.00

THESE FEATURES
GIVE INFORMATION
ON emp.var.rate:

month
0.69
contact
0.17
previous
0.10
poutcome
0.10
deposit
0.03
job
0.03
default
0.02
education
0.01
marital
0.01
day_of_week
0.00
housing
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
emp.var.rate
CORRELATION RATIO WITH...

euribor3m
1.00
nr.employed
0.99
cons.price.idx
0.88
cons.conf.idx
0.83
pdays
0.37
campaign
0.18
age
0.12
duration
0.06
cons.price.idx
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

euribor3m
0.69
nr.employed
0.52
campaign
0.13
cons.conf.idx
0.06
pdays
-0.04
duration
0.01
age
0.00

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

emp.var.rate
0.88
month
0.61
contact
0.59
previous
0.33
poutcome
0.31
default
0.17
deposit
0.14
job
0.12
education
0.10
housing
0.08
marital
0.06
day_of_week
0.02
loan
0.02
MOST FREQUENT VALUES

93.994
7,763
18.8%
93.91799999999999
6,685
16.2%
92.89299999999999
5,794
14.1%
93.444
5,175
12.6%
94.465
4,374
10.6%
93.2
3,616
8.8%
93.075
2,458
6.0%
92.20100000000001
770
1.9%
92.963
715
1.7%
92.431
447
1.1%
92.649
357
0.9%
94.215
311
0.8%
94.199
303
0.7%
92.84299999999999
282
0.7%
92.37899999999999
267
0.6%
SMALLEST VALUES

92.20100000000001
770
1.9%
92.37899999999999
267
0.6%
92.431
447
1.1%
92.469
178
0.4%
92.649
357
0.9%
92.713
172
0.4%
92.756
10
<0.1%
92.84299999999999
282
0.7%
92.89299999999999
5,794
14.1%
92.963
715
1.7%
93.075
2,458
6.0%
93.2
3,616
8.8%
93.369
264
0.6%
93.444
5,175
12.6%
93.749
174
0.4%
LARGEST VALUES

94.76700000000001
128
0.3%
94.601
204
0.5%
94.465
4,374
10.6%
94.215
311
0.8%
94.199
303
0.7%
94.055
229
0.6%
94.027
233
0.6%
93.994
7,763
18.8%
93.91799999999999
6,685
16.2%
93.876
212
0.5%
93.79799999999999
67
0.2%
93.749
174
0.4%
93.444
5,175
12.6%
93.369
264
0.6%
93.2
3,616
8.8%
cons.conf.idx
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

euribor3m
0.28
age
0.13
nr.employed
0.10
pdays
0.06
cons.price.idx
0.06
campaign
-0.01
duration
-0.01

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

emp.var.rate
0.83
month
0.61
contact
0.25
poutcome
0.18
job
0.16
previous
0.14
education
0.12
day_of_week
0.07
marital
0.06
deposit
0.05
housing
0.03
default
0.03
loan
0.02
MOST FREQUENT VALUES

-36.4
7,763
18.8%
-42.7
6,685
16.2%
-46.2
5,794
14.1%
-36.1
5,175
12.6%
-41.8
4,374
10.6%
-42.0
3,616
8.8%
-47.1
2,458
6.0%
-31.4
770
1.9%
-40.8
715
1.7%
-26.9
447
1.1%
-30.1
357
0.9%
-40.3
311
0.8%
-37.5
303
0.7%
-50.0
282
0.7%
-29.8
267
0.6%
SMALLEST VALUES

-50.8
128
0.3%
-50.0
282
0.7%
-49.5
204
0.5%
-47.1
2,458
6.0%
-46.2
5,794
14.1%
-45.9
10
<0.1%
-42.7
6,685
16.2%
-42.0
3,616
8.8%
-41.8
4,374
10.6%
-40.8
715
1.7%
-40.4
67
0.2%
-40.3
311
0.8%
-40.0
212
0.5%
-39.8
229
0.6%
-38.3
233
0.6%
LARGEST VALUES

-26.9
447
1.1%
-29.8
267
0.6%
-30.1
357
0.9%
-31.4
770
1.9%
-33.0
172
0.4%
-33.6
178
0.4%
-34.6
174
0.4%
-34.8
264
0.6%
-36.1
5,175
12.6%
-36.4
7,763
18.8%
-37.5
303
0.7%
-38.3
233
0.6%
-39.8
229
0.6%
-40.0
212
0.5%
-40.3
311
0.8%
euribor3m
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

nr.employed
0.95
cons.price.idx
0.69
cons.conf.idx
0.28
pdays
-0.25
campaign
0.14
duration
-0.03
age
0.01

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

emp.var.rate
1.00
month
0.58
poutcome
0.49
previous
0.49
contact
0.40
deposit
0.31
default
0.20
job
0.19
marital
0.11
housing
0.06
education
0.05
day_of_week
0.04
loan
0.00
MOST FREQUENT VALUES

4.857
2,868
7.0%
4.962
2,613
6.3%
4.963
2,487
6.0%
4.961
1,902
4.6%
4.856
1,210
2.9%
4.9639999999999995
1,175
2.9%
1.405
1,169
2.8%
4.965
1,071
2.6%
4.864
1,044
2.5%
4.96
1,013
2.5%
4.968
992
2.4%
4.959
895
2.2%
4.86
892
2.2%
4.855
840
2.0%
4.0760000000000005
822
2.0%
SMALLEST VALUES

0.634
8
<0.1%
0.635
43
0.1%
0.636
14
<0.1%
0.637
6
<0.1%
0.638
7
<0.1%
0.639
16
<0.1%
0.64
10
<0.1%
0.642
35
<0.1%
0.643
23
<0.1%
0.644
38
<0.1%
0.645
26
<0.1%
0.6459999999999999
49
0.1%
0.649
10
<0.1%
0.65
12
<0.1%
0.6509999999999999
7
<0.1%
LARGEST VALUES

5.045
9
<0.1%
5.0
7
<0.1%
4.97
172
0.4%
4.968
992
2.4%
4.967
643
1.6%
4.966
622
1.5%
4.965
1,071
2.6%
4.9639999999999995
1,175
2.9%
4.963
2,487
6.0%
4.962
2,613
6.3%
4.961
1,902
4.6%
4.96
1,013
2.5%
4.959
895
2.2%
4.958
581
1.4%
4.957
537
1.3%
nr.employed
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

euribor3m
0.95
cons.price.idx
0.52
pdays
-0.32
campaign
0.14
cons.conf.idx
0.10
duration
-0.04
age
-0.02

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

emp.var.rate
0.99
month
0.65
previous
0.52
poutcome
0.51
deposit
0.35
contact
0.27
job
0.23
default
0.19
marital
0.10
education
0.06
housing
0.05
day_of_week
0.03
loan
0.00
MOST FREQUENT VALUES

5228.1
16,234
39.4%
5099.1
8,534
20.7%
5191.0
7,763
18.8%
5195.8
3,683
8.9%
5076.2
1,663
4.0%
5017.5
1,071
2.6%
4991.6
773
1.9%
5008.7
650
1.6%
4963.6
635
1.5%
5023.5
172
0.4%
5176.3
10
<0.1%
SMALLEST VALUES

4963.6
635
1.5%
4991.6
773
1.9%
5008.7
650
1.6%
5017.5
1,071
2.6%
5023.5
172
0.4%
5076.2
1,663
4.0%
5099.1
8,534
20.7%
5176.3
10
<0.1%
5191.0
7,763
18.8%
5195.8
3,683
8.9%
5228.1
16,234
39.4%
LARGEST VALUES

5228.1
16,234
39.4%
5195.8
3,683
8.9%
5191.0
7,763
18.8%
5176.3
10
<0.1%
5099.1
8,534
20.7%
5076.2
1,663
4.0%
5023.5
172
0.4%
5017.5
1,071
2.6%
5008.7
650
1.6%
4991.6
773
1.9%
4963.6
635
1.5%
deposit
MISSING:
---
TOP CATEGORIES

0
36,548
89%
1
4,640
11%
ALL
41,188
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
deposit
PROVIDES INFORMATION ON...

poutcome
0.06
previous
0.04
emp.var.rate
0.03
contact
0.02
month
0.01
default
0.01
job
0.00
marital
0.00
education
0.00
day_of_week
0.00
housing
0.00
loan
0.00

THESE FEATURES
GIVE INFORMATION
ON deposit:

emp.var.rate
0.16
poutcome
0.09
month
0.08
previous
0.05
contact
0.03
job
0.03
default
0.02
education
0.01
marital
0.00
day_of_week
0.00
housing
0.00
loan
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
deposit
CORRELATION RATIO WITH...

duration
0.41
nr.employed
0.35
euribor3m
0.31
pdays
0.27
cons.price.idx
0.14
campaign
0.07
cons.conf.idx
0.05
age
0.03

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

EDA analysis:

From the target variable, we can see that 88.73 % of the customers haven’t subscribed to the financial product offered, therefore we have an imbalanced dataset. This situation will be refectled in the train, validation and test sets when modelling. For that we will use either the undersampling or oversampling techniques.

From the correlation plot we can observe that there are important correlations in several characteristics with respect to the variable “deposit” as well as between them. The above correlation matrix was plotted with all variables. Clearly, “campaign outcome” has a strong correlation with “duration,” a moderate correlation with “previous contacts,” and mild correlations between “balance,” “month of contact” and “number of campaign.”

grouped_gghistostats(
    data = dataset,
    x = age,
    grouping.var = deposit, # grouping variable
    normal.curve = TRUE, # superimpose a normal distribution curve
    normal.curve.args = list(color = "red", size = 1),
    ggtheme = ggthemes::theme_tufte(),
    plotgrid.args = list(nrow = 1),
    ggstatsplot.layer = FALSE,
    ggplot.component = list(theme(text = element_text(size = 6.3))),
    annotation.args = list(title = "Age distribution by deposit")
)

In the age variable, we observe that age is not an element that has much difference between customers who took deposits from those who did not, its average is around 40 years. However, both are 2 different groups statistically speaking due to the low p-value in the t test. The only remarkable difference we can highlight, is the fact that most of the old people subscribed to a deposit.

ggbarstats(data = dataset, x = education, y = deposit, title = "Education by deposit subscription", 
    legend.title = "Educational level", ggtheme = hrbrthemes::theme_ipsum_pub())

Education shows a difference between the different levels. For example, some types of clients present an efficiency of 13.72% (university students) while those with basic levels of studies do not reach 9% in some cases. We could say that we should aim to offer this product to clients who have college, professional or high school levels.

In the case of type of work, retirees, students, unemployed and management positions are those who lead with the best results to offer the financial product.

Regarding marital status, we could infer that single clients are a bit more sensitive to acquire the offer of term deposits.

The month variable is a good indicator. Note that the number of contacts and their efficiency varies strongly from month to month. For example, in March we obtained 50% efficiency with very few contacts made (only 500), however, in May 14 thousand contacts were made with only an efficiency of 6.4%

Regarding the variable pdays we can say that most of the clients were contacted for the first time.

ggbarstats(data = dataset, x = poutcome, y = deposit, title = "Outcome of the previous marketing campaign by current deposit subscription", 
    legend.title = "Educational level", ggtheme = hrbrthemes::theme_ipsum_pub())

Out of the people that subscribed for a deposit, only 19% of these customers had a previous deposit (sucessfull result) in the previous campaign.

As for the next section regarding the data pre-processing there is no need to impute the data since we don’t have missing values, regarding the outliers, we can see few outliers in the variable “age” and we will accept them since there are no regulations regarding the age of a customer to subscribe to a term deposit. If this case study would be credit-risk related, then we would have to discard these outliers or manipulate them.

Data manipulation

One-Hot Encoding of categorical variables

Most of our categorical data are variables that contain label values rather than numeric values. The number of possible values is often limited to a fixed set. The problem when modeling using categorical data is that some algorithms can work with categorical data directly, and a preliminary transformation of the variables has to be done prior the modeling process.

For example, a decision tree can be trained directly from categorical data with no data transform required (this depends on the specific implementation).

To this, many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric. In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.

The main idea is to split the column which contains categorical data to many columns depending on the number of categories present in that column. Each column contains “0” or “1” corresponding to which column it has been placed.

First: we create two data sets for numeric and non-numeric data

numerical = dataset.select_dtypes(exclude=['object'])
categorical = dataset.select_dtypes(include=['object'])

Second: One-hot encode the non-numeric columns

onehot = pd.get_dummies(categorical)

Third: Union the one-hot encoded columns to the numeric ones

df = pd.concat([numerical, onehot], axis=1)

Fourth: Print the columns in the new data set

glimpse(py$df)
Rows: 41,188
Columns: 64
$ age                           <dbl> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, ~
$ duration                      <dbl> 261, 149, 226, 151, 307, 198, 139, 217, ~
$ campaign                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ pdays                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ previous                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ emp.var.rate                  <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, ~
$ cons.price.idx                <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, ~
$ cons.conf.idx                 <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4~
$ euribor3m                     <dbl> 4.857, 4.857, 4.857, 4.857, 4.857, 4.857~
$ nr.employed                   <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191~
$ deposit                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ job_admin.                    <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0~
$ `job_blue-collar`             <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0~
$ job_entrepreneur              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ job_housemaid                 <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1~
$ job_management                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ job_retired                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ `job_self-employed`           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ job_services                  <int> 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0~
$ job_student                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ job_technician                <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0~
$ job_unemployed                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ job_unknown                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ marital_divorced              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1~
$ marital_married               <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0~
$ marital_single                <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0~
$ marital_unknown               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ education_basic.4y            <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1~
$ education_basic.6y            <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ education_basic.9y            <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0~
$ education_high.school         <int> 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0~
$ education_illiterate          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ education_professional.course <int> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0~
$ education_university.degree   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ education_unknown             <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0~
$ default_no                    <int> 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1~
$ default_unknown               <int> 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0~
$ default_yes                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ housing_no                    <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0~
$ housing_unknown               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ housing_yes                   <int> 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1~
$ loan_no                       <int> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1~
$ loan_unknown                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ loan_yes                      <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0~
$ contact_cellular              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ contact_telephone             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ month_apr                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ month_aug                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ month_dec                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ month_jul                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ month_jun                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ month_mar                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ month_may                     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ month_nov                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ month_oct                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ month_sep                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ day_of_week_fri               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ day_of_week_mon               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ day_of_week_thu               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ day_of_week_tue               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ day_of_week_wed               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ poutcome_failure              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ poutcome_nonexistent          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ poutcome_success              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~

With this method we end up with a larger dataframe of 64 columns.

df.shape
(41188, 64)

Creation of Training, Validation and Test datasets

In any machine learning project, after a EDA a common practice is to split the dataset into training, test and validation (if applies). For this we set a seed (123) for sampling reproducibility and split the one-hot encoded dataset in a training, validation and test set, by using pandas and numpy. The training set will contain 70% of the data, 15% for validation and the remaining 15% for our test set.

# We create the X and y data sets
X = df.loc[ : , df.columns != 'deposit']
y = df[['deposit']]

# Create training, evaluation and test sets
X_train, test_X, y_train, test_y = train_test_split(X, y, test_size=.3, random_state=123)
X_eval, X_test, y_eval, y_test = train_test_split(test_X, test_y, test_size=.5, random_state=123)

In order to check how imbalanced is our training dataset in terms of our target variable “deposit,” we run the following code to calculate the percentage of customers that did not subscribe to a term deposit in the train set.

# percentage of defaults and non-defaults in the training set
round(y_train['deposit'].value_counts()*100/len(y_train['deposit']), 2)
0    88.75
1    11.25
Name: deposit, dtype: float64

We find that 88.75% of the customer did not subscribe to a term deposit, and 11.25% got this financial product. For modeling purposes our dataset cannot be imbalanced as it would bias our estimations since many algorithms assume a balanced or closely balanced dataset. In the next section we will proceed with a technique that will allow us to balance our training set.

Balancing dataset

Imbalanced data typically refers to a problem with classification problems where the classes are not represented equally. Most classification data sets do not have exactly equal number of instances in each class, but a small difference often does not matter, however, our dataset is imbalanced.

Our imbalanced dataset is not adequate for predictive modeling, as mentioned above, most of the machine learning algorithms used for classification were designed around the assumption of an equal number of examples for each class. This results in models that have poor predictive performance, specifically for the minority class. This is a problem because typically, the minority class is more important and therefore the problem is more sensitive to classification errors for the minority class than the majority class.

For balancing our dataset we will use the undersampling technique that consists in sampling from the majority class in order to keep only a part of these points. This will reduce the number of rows of our dataset, however, we can afford to apply such method because our training set is quite large.

First we create data sets for deposits and no-deposits:

X_y_train = pd.concat([X_train.reset_index(drop = True), y_train.reset_index(drop = True)], axis = 1)
count_no_deposit, count_deposit = X_y_train['deposit'].value_counts()
no_deposit = X_y_train[X_y_train['deposit'] == 0]
deposit = X_y_train[X_y_train['deposit'] == 1]

Second we undersample the no-deposits

no_deposit_under = no_deposit.sample(count_deposit)

Third, we concatenate the undersampled nondefaults with defaults

train_balanced = pd.concat([no_deposit_under.reset_index(drop = True), deposit.reset_index(drop = True)], axis = 0)

Lastly, we check the proportion of deposit and no deposits in our target variable:

round(train_balanced['deposit'].value_counts()*100/len(train_balanced['deposit']), 2)
1    50.0
0    50.0
Name: deposit, dtype: float64

We get a balanced training dataset with 50% of customers that subscribed to a term deposit, and another 50% that did not. However, this undersampled but balanced dataset has now 6488 rows.

From our balanced train dataset we set our X_train feature matrix that contains all independent variables by running the following code:

X_train = train_balanced.loc[ : , train_balanced.columns != 'deposit']
y_train = train_balanced[['deposit']]

Model Building and Evaluation

In this section we will use supervised learning algorithms in order to predict and estimate an output based on one or more inputs. In our case, we want to predict whether a customer will subscribe to a term deposit based on some input data described before.

Logistic Regression Model

Logistic regression models predicts the probability of the default class. In our case, this model will predict the probability of a customer subscribing to a term deposit.

We start by training the logistic regression model on the training data.

clf_logistic = LogisticRegression(max_iter = 100000).fit(X_train, np.ravel(y_train))

Based on the trained model, we predict the probability that a customer has to subscribing to a term deposits using validation data.

preds = clf_logistic.predict_proba(X_eval)

The function used in the previous chunk of code predict_proba provides probabilities for in a range of (0,1) including float numbers. The first column refers to the probability for a customer of not getting a term deposit, and the second column is the probability of subscribing to a term deposit. Now, we create a dataframe of predictions of subscribing to a term deposit, and the true values of people that subscribed to a term deposit:

preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit'])
true_df = y_eval
pred_comparison = pd.concat([true_df.reset_index(drop = True), preds_df], axis = 1)
pred_comparison.head(10)
   deposit  prob_accept_deposit
0        0             0.036086
1        0             0.039713
2        0             0.794001
3        0             0.054322
4        0             0.012686
5        0             0.355502
6        0             0.025778
7        0             0.021069
8        0             0.069364
9        1             0.165333

We are interested in checking the classification report of this model. For this, we reassign the probability of accepting a deposit based on the threshold 0.5 which is the middle point between 0 and 1, this is a common approach is many other algorithms. In other words, any estimated probability higher than 0.5 will be assigned as a deposit (1), otherwise as a no-deposit (0).

preds_df['prob_accept_deposit'] = preds_df['prob_accept_deposit'].apply(lambda x: 1 if x > 0.5 else 0)

We can slightly compare how differ the estimations with the real values by the difference in deposits.

Count of estimated deposits by the logistic model:

print(preds_df['prob_accept_deposit'].value_counts())
0    4760
1    1418
Name: prob_accept_deposit, dtype: int64

Count of real deposits in our test set:

print(true_df['deposit'].value_counts())
0    5496
1     682
Name: deposit, dtype: int64

Choosing the right metric is crucial while evaluating machine learning (ML) models. Various metrics are proposed to evaluate ML models in different applications

By the nature of this case study we are having a classification problem. Therefore we will choose as our metric for model performance Recall (aka Sensitivity, TPR or True Positive Rate) which is defined as the fraction of samples from a class which are correctly predicted by the model.

The Recall metric provides us with the answer to a the question “Of all of the positive samples, what proportion did I predict correctly?” It concentrates on the false negatives (FN) and are observations that our algorithm missed. The lower the number of FN is, the better prediction power of our model. In this case study we have 2 classes in our target variable, whether a customer subscribes to a term deposit or not. To this we will analyse the same metric for both classes.

\(Recall(Deposit) = \frac{True Positives}{True Positives + False Negatives}\)

\(Recall(No-Deposit) = \frac{True Negatives}{True Negatives+ False Positives}\)

Another important metric that will be also analysed but not taken in consideration when choosing the models is the Accuracy, this is perhaps the simplest metrics one can imagine, and is defined as the number of correct predictions divided by the total number of predictions.

\(Accuracy = \frac{True Positives + True Negatives}{True Positives + False Positives + True Negatives + False Negatives}\)

In order to check the performance of our model it is necessary to check the classification report, by running the following chunk of code:

target_names = ['No-deposit', 'Deposit']
print(classification_report(y_eval, preds_df['prob_accept_deposit'], target_names=target_names))
              precision    recall  f1-score   support

  No-deposit       0.99      0.85      0.91      5496
     Deposit       0.43      0.90      0.58       682

    accuracy                           0.86      6178
   macro avg       0.71      0.88      0.75      6178
weighted avg       0.92      0.86      0.88      6178

We check the accuracy score the model as follows, although this is not our metric of interest. We check this value by observing the above table or by running the following chunk of code:

print(clf_logistic.score(X_eval, y_eval).round(2))
0.86

It means that this model correctly predicts 85% of the classes. Finally, we check the confusion matrix which is a table with 4 different combinations of predicted and actual values

Where:

TN = True Negatives

TP = True Positives

FN = False Negatives

FP = False positives

# Print the confusion matrix
matrix = confusion_matrix(y_eval,preds_df['prob_accept_deposit'])
print(matrix)
[[4692  804]
 [  68  614]]

TN = 4696

TP = 611

FN = 71

FP = 800

We are interested to evaluate our model based on the metric recall aka true positive rate for subscribing to a term deposits which can visualised in the classification report:

recall_log_reg_1 = round(matrix[1][1]/(matrix[1][1]+matrix[1][0]), 2)
print(recall_log_reg_1)
0.9

Also, we want to improve this metric by maximizing in terms of accuracy. As seen in before, the cut-off point for assigning the categories from the predictions was 0.5. The cut-off point is the point that will indicate whether a customer with certain characteristics will subscribe to a term deposit. If the probability becomes more than the cut-off point, the customer will be in the class of “Deposit,” otherwise will be in the class of “No-deposit.”

We can however set an optimal threshold for the classification and improve our recall metric. Before proceeding with this we must reset the preds_df dataframe with the original predicted probabilities, and overwrite those that resulted from the previous arbitrary cut-off

preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit']).reset_index(drop = True)

First we run a for loop that evaluates the model’s performance with different probability cut-offs points, from 0 to 1 by increments of 0.001.

numbers  = [float(x)/1000 for x in range(1000)]
for i in numbers:
    preds_df[i]= preds_df.prob_accept_deposit.map(lambda x: 1 if x > i else 0)
preds_df.head(5)
   prob_accept_deposit  0.0  0.001  0.002  ...  0.996  0.997  0.998  0.999
0             0.036086    1      1      1  ...      0      0      0      0
1             0.039713    1      1      1  ...      0      0      0      0
2             0.794001    1      1      1  ...      0      0      0      0
3             0.054322    1      1      1  ...      0      0      0      0
4             0.012686    1      1      1  ...      0      0      0      0

[5 rows x 1001 columns]

Then we calculate the metrics: accuracy sensitivity, and recalls for deposit and no deposit for various probability cutoffs.

cutoff_df = pd.DataFrame( columns = ['prob','accs','def_recalls','nondef_recalls'])
for i in numbers:
    cm1 = metrics.confusion_matrix(true_df, preds_df[i])
    total1=sum(sum(cm1))
    accs = (cm1[0][0]+cm1[1][1])/total1
    
    def_recalls = cm1[1][1]/(cm1[1][1]+cm1[1][0])
    nondef_recalls = cm1[0][0]/(cm1[0][0]+cm1[0][1])
    cutoff_df.loc[i] =[ i ,accs,def_recalls,nondef_recalls]
print(cutoff_df.head(5))
        prob      accs  def_recalls  nondef_recalls
0.000  0.000  0.110392          1.0        0.000000
0.001  0.001  0.110392          1.0        0.000000
0.002  0.002  0.110392          1.0        0.000000
0.003  0.003  0.110554          1.0        0.000182
0.004  0.004  0.111525          1.0        0.001274

Now we are able to choose the best cut-off based on the trade-off between deposit recall and no-deposit recall.

cutoff_df <- py$cutoff_df

names(cutoff_df)[names(cutoff_df) == "prob"] <- "Probability cut-off"
names(cutoff_df)[names(cutoff_df) == "accs"] <- "Accuracy"
names(cutoff_df)[names(cutoff_df) == "def_recalls"] <- "Deposit Recall"
names(cutoff_df)[names(cutoff_df) == "nondef_recalls"] <- "No-deposit Recall"

cutoff_df <- cutoff_df %>%
    gather(key = "metric", value = "value", -`Probability cut-off`)

ggplot(cutoff_df, aes(x = `Probability cut-off`, y = value, color = metric)) + geom_line(aes(linetype = metric)) + 
    ggtitle("Accuracy, Deposit Recall and No-deposit Recall") + scale_fill_discrete(name = "Metrics", 
    labels = c("Accuracy", "Deposit Recall", "No-deposit Recall"))

The optimal cut-off point is the following:

cutoff_df["diff"] = abs(cutoff_df.def_recalls - cutoff_df.nondef_recalls)
best_threshold = cutoff_df["prob"].loc[cutoff_df["diff"] == min(cutoff_df["diff"])]
best_threshold = best_threshold.iloc[0]
print(best_threshold)
0.555

Now we can implement the optimal threshold to the model. Again, we calculate the probability predictions from the model, then we create a dataframe with such predictions.

preds = clf_logistic.predict_proba(X_eval)
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit'])
true_df = y_eval

Then we reassign the probability of accepting a deposit based on the optimal threshold.

preds_df['prob_accept_deposit'] = preds_df['prob_accept_deposit'].apply(lambda x: 1 if x > best_threshold else 0)

We can slightly compare how differ the estimations with the real values by the difference in deposits.

Count of estimated deposits by the logistic model:

print(preds_df['prob_accept_deposit'].value_counts())
0    4876
1    1302
Name: prob_accept_deposit, dtype: int64

Count of real deposits in our test set:

print(true_df['deposit'].value_counts())
0    5496
1     682
Name: deposit, dtype: int64

For further information it is necessary to check the classification report:

target_names = ['No-deposit', 'Deposit']
print(classification_report(y_eval, preds_df['prob_accept_deposit'], target_names=target_names))
              precision    recall  f1-score   support

  No-deposit       0.98      0.87      0.92      5496
     Deposit       0.46      0.87      0.60       682

    accuracy                           0.87      6178
   macro avg       0.72      0.87      0.76      6178
weighted avg       0.92      0.87      0.89      6178

By setting this new cut-off our recall metric is balanced in both classes and the model improves the correctness of classification in each of the classes.

We check the confusion matrix and compare with the previous one

# Print the confusion matrix
matrix_2 = confusion_matrix(y_eval,preds_df['prob_accept_deposit'])
print(matrix_2)
[[4788  708]
 [  88  594]]

TN = 4778

TP = 593

FN = 89

FP = 718

Now we check the accuracy after assigning the values with the new cut-off point.

accuracy_log_reg_1 = round((matrix_2[0][0]+matrix_2[1][1])/sum(sum(matrix_2)), 3)
print(accuracy_log_reg_1)
0.871

There is a considerable improvement in the accuracy.

We are interested to evaluate our model based on the metric recall aka true positive rate for subscribing to a term deposits:

recall_deposit_log_reg_1 = round(matrix_2[1][1]/(matrix_2[1][1]+matrix_2[1][0]), 2)
print(recall_deposit_log_reg_1)
0.87

Now we proceed to calculate the Area Under Curve (AUC) score: stands for “Area under the ROC Curve.” The AUC measures the entire two-dimensional area underneath the entire ROC curve and allows comparison of classifiers by comparing the total area underneath the line produced on the ROC curve. AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0, and one whose predictions are 100% correct has an AUC of 1.0.

The Receiver Operating Characteristic (ROC) Curve: is a two-dimensional graph that depicts trade-offs between benefits (true positives) and costs (false positives). It displays a relation between sensitivity and specificity for a given classifier. The TPR (True Positive Rate) is plot on the Y axis and the FPR (False Positive Rate) on the X axis, where the TPR is the percentage of true positives (to the sum of true positives and false negatives and the FPR is the percentage of false positives (to the sum of false positives and true negatives. A ROC curve examines a single classifier over a set of classification thresholds.

prob_deposit_log_reg_1 = preds[:, 1]
auc_log_reg_1 = round(roc_auc_score(y_eval, prob_deposit_log_reg_1), 3)
print(auc_log_reg_1)
0.939

Regularized Logistic Regression Model

In this section we use the same model (logistic regression), however, this time we use regularization techniques. Regularization techniques work by limiting the capacity of models (such as logistic regression) by adding a parameter norm penalty $\lambda$ to the objective function. As follows:

Generally we trade off some bias to get lower variance, and lower variance estimators tend to overfit less. However, our Ridge Regression (aka L2-norm penalty) is an assumption about the function we’re fitting (we’re assuming that it has a small gradient). In general, when we trade off bias for lower variance, it’s because we’re biasing towards the kind of functions we want to fit.

Our logistic regression model uses the optimisation algorithm Stochastic Average Gradient (SAG), also we set max_iter = 10000 (a large number) to allow convergence of the estimates.

clf_logistic2 = LogisticRegression(solver='sag', max_iter = 10000, penalty = 'l2').fit(X_train, np.ravel(y_train))

As in the above section we make predictions using the evaluation dataset.

preds = clf_logistic2.predict_proba(X_eval)

These predictions are stored in a dataframe instead of an array.

preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit'])
true_df = y_eval

Based on the same approach for selecting the best cut-off point, we implement the same algorithm to find the optimal probability cut-off point in order to balance our recall metric. Again we try to classify the probabilities using different cut-off points from 0 to 1 by increments of 0.001

numbers  = [float(x)/1000 for x in range(1000)]
for i in numbers:
    preds_df[i]= preds_df.prob_accept_deposit.map(lambda x: 1 if x > i else 0)
preds_df.head(5)
   prob_accept_deposit  0.0  0.001  0.002  ...  0.996  0.997  0.998  0.999
0             0.066219    1      1      1  ...      0      0      0      0
1             0.055719    1      1      1  ...      0      0      0      0
2             0.809286    1      1      1  ...      0      0      0      0
3             0.062395    1      1      1  ...      0      0      0      0
4             0.020000    1      1      1  ...      0      0      0      0

[5 rows x 1001 columns]

Then we calculate the metrics: accuracy sensitivity, and recalls for deposit and no deposit for various probability cutoffs.

cutoff_df = pd.DataFrame( columns = ['prob','accs','def_recalls','nondef_recalls'])
for i in numbers:
    cm1 = metrics.confusion_matrix(true_df, preds_df[i])
    total1=sum(sum(cm1))
    accs = (cm1[0][0]+cm1[1][1])/total1
    
    def_recalls = cm1[1][1]/(cm1[1][1]+cm1[1][0])
    nondef_recalls = cm1[0][0]/(cm1[0][0]+cm1[0][1])
    cutoff_df.loc[i] =[ i ,accs,def_recalls,nondef_recalls]
print(cutoff_df.head(5))
        prob      accs  def_recalls  nondef_recalls
0.000  0.000  0.110392          1.0        0.000000
0.001  0.001  0.110392          1.0        0.000000
0.002  0.002  0.110392          1.0        0.000000
0.003  0.003  0.110554          1.0        0.000182
0.004  0.004  0.110554          1.0        0.000182

Now we are able to choose the best cut-off based on the trade-off between deposit recall and no-deposit recall.

cutoff_df <- py$cutoff_df

names(cutoff_df)[names(cutoff_df) == "prob"] <- "Probability cut-off"
names(cutoff_df)[names(cutoff_df) == "accs"] <- "Accuracy"
names(cutoff_df)[names(cutoff_df) == "def_recalls"] <- "Deposit Recall"
names(cutoff_df)[names(cutoff_df) == "nondef_recalls"] <- "No-deposit Recall"

cutoff_df <- cutoff_df %>%
    gather(key = "metric", value = "value", -`Probability cut-off`)

ggplot(cutoff_df, aes(x = `Probability cut-off`, y = value, color = metric)) + geom_line(aes(linetype = metric)) + 
    ggtitle("Accuracy, Deposit Recall and No-deposit Recall") + scale_fill_discrete(name = "Metrics", 
    labels = c("Accuracy", "Deposit Recall", "No-deposit Recall"))

The optimal cut-off point is the following:

cutoff_df["diff"] = abs(cutoff_df.def_recalls - cutoff_df.nondef_recalls)
best_threshold = cutoff_df["prob"].loc[cutoff_df["diff"] == min(cutoff_df["diff"])]
best_threshold = best_threshold.iloc[0]
print(best_threshold)
0.525

Then we reassign the probability of accepting a deposit based on the optimal threshold.

preds_df['prob_accept_deposit'] = preds_df['prob_accept_deposit'].apply(lambda x: 1 if x > best_threshold else 0)

We can slightly compare how differ the estimations with the real values by the difference in deposits.

Count of estimated deposits by the logistic model:

print(preds_df['prob_accept_deposit'].value_counts())
0    4818
1    1360
Name: prob_accept_deposit, dtype: int64

Count of real deposits in our test set:

print(true_df['deposit'].value_counts())
0    5496
1     682
Name: deposit, dtype: int64

For further information it is necessary to check the classification report:

target_names = ['No-deposit', 'Deposit']
print(classification_report(y_eval, preds_df['prob_accept_deposit'], target_names=target_names))
              precision    recall  f1-score   support

  No-deposit       0.98      0.86      0.92      5496
     Deposit       0.43      0.86      0.57       682

    accuracy                           0.86      6178
   macro avg       0.71      0.86      0.74      6178
weighted avg       0.92      0.86      0.88      6178

We can see now a more balanced recall metric, however the values are lower than the previous one. We check the accuracy score the model as follows:

Finally, we check the confusion matrix

# Print the confusion matrix
matrix_3 = confusion_matrix(y_eval,preds_df['prob_accept_deposit'])
print(matrix_3)
[[4722  774]
 [  96  586]]

Now we check the accuracy of the model after assigning the optimal probability cut-off point:

accuracy_log_reg_2 = round((matrix_3[0][0]+matrix_3[1][1])/sum(sum(matrix_3)), 3)
print(accuracy_log_reg_2)
0.859

The accuracy of this model is lower than the previous one.

We are interested in evaluating our model based on the metric recall aka true positive rate for subscribing to a term deposits:

recall_deposit_log_reg_2 = round(matrix_3[1][1]/(matrix_3[1][1]+matrix_3[1][0]), 3)
print(recall_deposit_log_reg_2)
0.859

We can see a considerable improvement of 1% difference

#AUC
prob_deposit_log_reg_2 = preds[:, 1]
auc_log_reg_2 = round(roc_auc_score(y_eval, prob_deposit_log_reg_2), 3)
print(auc_log_reg_2)
0.927

So far our first model yields better returns in terms of accuracy, recall and AUC. This model however, since it is penalised, the estimates have been affected. Therefore we can assume that the first model has less risk of overfitting.

Reduced Logistic Regression Model

Our dataset has 63 independent variables, and many of these do not impact on the target variable. Many of this variables are called, noisy data. The occurrences of noisy data in data set can significantly impact prediction of any meaningful information. Many empirical studies have shown that noise in data set dramatically led to decreased classification accuracy and poor prediction results (Gupta and Gupta 2019).

In order to eliminate the noisy data in our training dataset, we will use the Recursive Feature Elimination (RFE) method which is a feature selection approach. It works by recursively removing attributes and building a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

Our goal is not reduce our data to 1/3. Up to 20 variables.

logreg = LogisticRegression(max_iter = 10000)
rfe = RFE(logreg, 20)
rfe = rfe.fit(X_train, y_train)
print(list(zip(X_train.columns, rfe.support_, rfe.ranking_)))
[('age', False, 42), ('duration', False, 33), ('campaign', False, 21), ('pdays', False, 16), ('previous', True, 1), ('emp.var.rate', True, 1), ('cons.price.idx', False, 25), ('cons.conf.idx', False, 41), ('euribor3m', True, 1), ('nr.employed', False, 27), ('job_admin.', False, 24), ('job_blue-collar', False, 6), ('job_entrepreneur', False, 32), ('job_housemaid', False, 7), ('job_management', False, 5), ('job_retired', True, 1), ('job_self-employed', False, 3), ('job_services', False, 4), ('job_student', True, 1), ('job_technician', False, 30), ('job_unemployed', False, 17), ('job_unknown', True, 1), ('marital_divorced', False, 19), ('marital_married', False, 35), ('marital_single', False, 22), ('marital_unknown', False, 38), ('education_basic.4y', True, 1), ('education_basic.6y', False, 14), ('education_basic.9y', False, 20), ('education_high.school', False, 26), ('education_illiterate', True, 1), ('education_professional.course', False, 28), ('education_university.degree', False, 10), ('education_unknown', False, 12), ('default_no', True, 1), ('default_unknown', False, 9), ('default_yes', False, 44), ('housing_no', False, 18), ('housing_unknown', True, 1), ('housing_yes', False, 29), ('loan_no', False, 31), ('loan_unknown', False, 2), ('loan_yes', False, 39), ('contact_cellular', False, 13), ('contact_telephone', False, 23), ('month_apr', False, 43), ('month_aug', False, 15), ('month_dec', True, 1), ('month_jul', True, 1), ('month_jun', False, 40), ('month_mar', True, 1), ('month_may', True, 1), ('month_nov', True, 1), ('month_oct', True, 1), ('month_sep', True, 1), ('day_of_week_fri', False, 36), ('day_of_week_mon', True, 1), ('day_of_week_thu', False, 34), ('day_of_week_tue', False, 37), ('day_of_week_wed', False, 11), ('poutcome_failure', True, 1), ('poutcome_nonexistent', False, 8), ('poutcome_success', True, 1)]

The variables having showing True are the ones we are interested in. Now we select the variables we are interested in by running the following chunk of code:

col = X_train.columns[rfe.support_]
X_train_reduced = X_train[col]
X_eval_reduced = X_eval[col]

Now we can train our model with our training data with 19 variables.

clf_logistic3 = LogisticRegression(max_iter = 100000).fit(X_train_reduced, np.ravel(y_train))

As we the model 1, we will make predictions and look for the optimal cut-off point for classification.

preds = clf_logistic3.predict_proba(X_eval_reduced)
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit'])
true_df = y_eval

We try to classify the probabilities using different cut-off points from 0 to 1 by increments of 0.001

numbers  = [float(x)/1000 for x in range(1000)]
for i in numbers:
    preds_df[i]= preds_df.prob_accept_deposit.map(lambda x: 1 if x > i else 0)
preds_df.head(5)
   prob_accept_deposit  0.0  0.001  0.002  ...  0.996  0.997  0.998  0.999
0             0.269626    1      1      1  ...      0      0      0      0
1             0.287275    1      1      1  ...      0      0      0      0
2             0.921524    1      1      1  ...      0      0      0      0
3             0.381851    1      1      1  ...      0      0      0      0
4             0.287227    1      1      1  ...      0      0      0      0

[5 rows x 1001 columns]

We create as many confusion matrices as cut-off and calculate the accuracy and recalls for each confusion matrix cut-off.

cutoff_df = pd.DataFrame( columns = ['prob','accs','def_recalls','nondef_recalls'])
for i in numbers:
    cm1 = metrics.confusion_matrix(true_df, preds_df[i])
    total1=sum(sum(cm1))
    accs = (cm1[0][0]+cm1[1][1])/total1
    
    def_recalls = cm1[1][1]/(cm1[1][1]+cm1[1][0])
    nondef_recalls = cm1[0][0]/(cm1[0][0]+cm1[0][1])
    cutoff_df.loc[i] =[ i ,accs,def_recalls,nondef_recalls]
print(cutoff_df.head(5))
        prob      accs  def_recalls  nondef_recalls
0.000  0.000  0.110392          1.0             0.0
0.001  0.001  0.110392          1.0             0.0
0.002  0.002  0.110392          1.0             0.0
0.003  0.003  0.110392          1.0             0.0
0.004  0.004  0.110392          1.0             0.0
cutoff_df <- py$cutoff_df

names(cutoff_df)[names(cutoff_df) == "prob"] <- "Probability cut-off"
names(cutoff_df)[names(cutoff_df) == "accs"] <- "Accuracy"
names(cutoff_df)[names(cutoff_df) == "def_recalls"] <- "Deposit Recall"
names(cutoff_df)[names(cutoff_df) == "nondef_recalls"] <- "No-deposit Recall"

cutoff_df <- cutoff_df %>%
    gather(key = "metric", value = "value", -`Probability cut-off`)

ggplot(cutoff_df, aes(x = `Probability cut-off`, y = value, color = metric)) + geom_line(aes(linetype = metric)) + 
    ggtitle("Accuracy, Deposit Recall and No-deposit Recall") + scale_fill_discrete(name = "Metrics", 
    labels = c("Accuracy", "Deposit Recall", "No-deposit Recall"))

With this information we are able to choose the best cut-off point:

cutoff_df["diff"] = abs(cutoff_df.def_recalls - cutoff_df.nondef_recalls)
best_threshold = cutoff_df["prob"].loc[cutoff_df["diff"] == min(cutoff_df["diff"])]
best_threshold = best_threshold.iloc[0]
print(best_threshold)
0.403

We can proceed to classify our model with the best cut-off point:

preds = clf_logistic3.predict_proba(X_eval_reduced)
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit'])
true_df = y_eval

We implement the cut-off point in the discriminatory process

preds_df['prob_accept_deposit'] = preds_df['prob_accept_deposit'].apply(lambda x: 1 if x > best_threshold else 0)

After that, we can see the results from the classification report:

target_names = ['No-deposit', 'Deposit']
print(classification_report(y_eval, preds_df['prob_accept_deposit'], target_names=target_names))
              precision    recall  f1-score   support

  No-deposit       0.96      0.73      0.83      5496
     Deposit       0.26      0.74      0.38       682

    accuracy                           0.73      6178
   macro avg       0.61      0.74      0.61      6178
weighted avg       0.88      0.73      0.78      6178

This model is less robust, by decreasing the number of variables to the top 20 most important ones, the accuracy has dropped almost 10%. We can infer that this model is very sensitive to input changes, suggesting that it can be easily overfitted. This will be considered later on in further model implementations.

We analyse the confusion matrix of this model:

matrix_4 = confusion_matrix(y_eval,preds_df['prob_accept_deposit'])
print(matrix_4)
[[4032 1464]
 [ 179  503]]

Indeed there are more false negatives than in the previous models.

We calculate the accuracy of the model:

accuracy_log_reg_3 = round((matrix_4[0][0]+matrix_4[1][1])/sum(sum(matrix_4)), 3)
print(accuracy_log_reg_3)
0.734

Now we proceed to calculate the recall for deposits

recall_deposit_log_reg_3 = round(matrix_4[1][1]/(matrix_4[1][1]+matrix_4[1][0]), 2)
print(recall_deposit_log_reg_3)
0.74

Finally, we calculate the AUC for this model:

prob_deposit_log_reg_3 = preds[:, 1]
auc_log_reg_3 = round(roc_auc_score(y_eval, prob_deposit_log_reg_3), 3)
print(auc_log_reg_3)
0.793

After going through different implementations of the logistic regression model, we can compare all of them and choose the best in terms of Recall and AUC. The logistic regression models’ results are depicted in the following table:

data = {'Model': ['Logistic Regression Model 1', 'Regularized Logistic Regression Model', 'Reduced Logistic Regression Model'], 
        'Accuracy': [accuracy_log_reg_1, accuracy_log_reg_2, accuracy_log_reg_3],
        'Recall': [recall_deposit_log_reg_1, recall_deposit_log_reg_2, recall_deposit_log_reg_3],
        'AUC': [auc_log_reg_1, auc_log_reg_2, auc_log_reg_3]
        } 
comparison = pd.DataFrame(data) 
print(comparison.sort_values(["Recall", "AUC"], ascending = False))
                                   Model  Accuracy  Recall    AUC
0            Logistic Regression Model 1     0.871   0.870  0.939
1  Regularized Logistic Regression Model     0.859   0.859  0.927
2      Reduced Logistic Regression Model     0.734   0.740  0.793

Gradient Boosting Trees Model

In this section we will be using the Gradient Boosting algorithm for our classification problem. Gradient boosting is a type of machine learning boosting. It relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error. The key idea is to set the target outcomes for this next model in order to minimize the error.

Gradient boosting involves three elements:

  1. A loss function to be optimized. In our case, as it is a classification problem, it will optimise the logarithmic loss.

  2. A weak learner to make predictions. For this the algorithm will use decision trees as weak learners.

  3. An additive model to add weak learners to minimize the loss function. It means that trees are added one at a time, and existing trees in the model are not changed. A gradient descent procedure is used to minimize the loss when adding trees.

Gradient boosting is a greedy algorithm and can overfit a training dataset quickly. For this we will use some hyperparameter tuning to avoid the overfitting problem(Hastie et al. 2009).

First we will train a gradient boosting model:

clf_gbt = xgb.XGBClassifier(use_label_encoder=False).fit(X_train, np.ravel(y_train))
[15:44:17] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

Based on the trained model, we predict the probability that a customer has to subscribing to a term deposits using validation data.

preds = clf_gbt.predict_proba(X_eval)

As with the previous models we store the predictions in a dataframe.

preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit'])
true_df = y_eval
pred_comparison = pd.concat([true_df.reset_index(drop = True), preds_df], axis = 1)
pred_comparison.head(10)
   deposit  prob_accept_deposit
0        0             0.000669
1        0             0.002987
2        0             0.994608
3        0             0.000813
4        0             0.000231
5        0             0.665680
6        0             0.001185
7        0             0.000039
8        0             0.001447
9        1             0.050187

We used the predict proba function so that we can look for the optimal cut-off point in which recall and accuracy is maximised.

numbers  = [float(x)/1000 for x in range(1000)]
for i in numbers:
    preds_df[i]= preds_df.prob_accept_deposit.map(lambda x: 1 if x > i else 0)
preds_df.head(5)
   prob_accept_deposit  0.0  0.001  0.002  ...  0.996  0.997  0.998  0.999
0             0.000669    1      0      0  ...      0      0      0      0
1             0.002987    1      1      1  ...      0      0      0      0
2             0.994608    1      1      1  ...      0      0      0      0
3             0.000813    1      0      0  ...      0      0      0      0
4             0.000231    1      0      0  ...      0      0      0      0

[5 rows x 1001 columns]

Now are able to calculate the accuracy and recall for different cut.off points.

cutoff_df = pd.DataFrame( columns = ['prob','accs','def_recalls','nondef_recalls'])
for i in numbers:
    cm1 = metrics.confusion_matrix(true_df, preds_df[i])
    total1=sum(sum(cm1))
    accs = (cm1[0][0]+cm1[1][1])/total1
    
    def_recalls = cm1[1][1]/(cm1[1][1]+cm1[1][0])
    nondef_recalls = cm1[0][0]/(cm1[0][0]+cm1[0][1])
    cutoff_df.loc[i] =[ i ,accs,def_recalls,nondef_recalls]
print(cutoff_df.head(5))
        prob      accs  def_recalls  nondef_recalls
0.000  0.000  0.110392          1.0        0.000000
0.001  0.001  0.384429          1.0        0.308042
0.002  0.002  0.462124          1.0        0.395378
0.003  0.003  0.497572          1.0        0.435226
0.004  0.004  0.519586          1.0        0.459971

A graphical representation of the optimal cut-off point is the following:

cutoff_df <- py$cutoff_df

names(cutoff_df)[names(cutoff_df) == "prob"] <- "Probability cut-off"
names(cutoff_df)[names(cutoff_df) == "accs"] <- "Accuracy"
names(cutoff_df)[names(cutoff_df) == "def_recalls"] <- "Deposit Recall"
names(cutoff_df)[names(cutoff_df) == "nondef_recalls"] <- "No-deposit Recall"

cutoff_df <- cutoff_df %>%
    gather(key = "metric", value = "value", -`Probability cut-off`)

ggplot(cutoff_df, aes(x = `Probability cut-off`, y = value, color = metric)) + geom_line(aes(linetype = metric)) + 
    ggtitle("Accuracy, Deposit Recall and No-deposit Recall") + scale_fill_discrete(name = "Metrics", 
    labels = c("Accuracy", "Deposit Recall", "No-deposit Recall"))

With this information we are able to calculate the best cut-off point as follows:

cutoff_df["diff"] = abs(cutoff_df.def_recalls - cutoff_df.nondef_recalls)
best_threshold = cutoff_df["prob"].loc[cutoff_df["diff"] == min(cutoff_df["diff"])]
best_threshold = best_threshold.iloc[0]
print(best_threshold)
0.657

With the optimal cut-off point we can recode our predicted probabilities:

preds_df['prob_accept_deposit'] = preds_df['prob_accept_deposit'].apply(lambda x: 1 if x > best_threshold else 0)

After that, we can see the results from the classification report:

target_names = ['No-deposit', 'Deposit']
print(classification_report(y_eval, preds_df['prob_accept_deposit'], target_names=target_names))
              precision    recall  f1-score   support

  No-deposit       0.98      0.88      0.93      5496
     Deposit       0.48      0.88      0.62       682

    accuracy                           0.88      6178
   macro avg       0.73      0.88      0.78      6178
weighted avg       0.93      0.88      0.90      6178

This model exhibits a huge improvement in terms of recall and accuracy. When comparing this initial model with respect to the logistic regression ones, this model is the best.

We analyse the confusion matrix of this model:

matrix_5 = confusion_matrix(y_eval,preds_df['prob_accept_deposit'])
print(matrix_5)
[[4853  643]
 [  80  602]]

We calculate the accuracy of the model:

accuracy_XGB_1 = round((matrix_5[0][0]+matrix_5[1][1])/sum(sum(matrix_5)), 3)
print(accuracy_XGB_1)
0.883

Now we proceed to calculate the recall for deposits

recall_XGB_1 = round(matrix_5[1][1]/(matrix_5[1][1]+matrix_5[1][0]), 2)
print(recall_XGB_1)
0.88

Finally, we calculate the AUC for this model:

prob_deposit_xgb_1 = preds[:, 1]
auc_XGB_1 = round(roc_auc_score(y_eval, prob_deposit_xgb_1), 3)
print(auc_XGB_1)
0.947

Reduced Gradient Boosting Trees Model

A benefit of using gradient boosting is that after the boosted trees are constructed, it is relatively straightforward to retrieve importance scores for each attribute.

Generally, importance provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model. The more an attribute is used to make key decisions with decision trees, the higher its relative importance.

This importance is calculated explicitly for each attribute in the dataset, allowing attributes to be ranked and compared to each other.

Importance is calculated for a single decision tree by the amount that each attribute split point improves the performance measure, weighted by the number of observations the node is responsible for. The performance measure may be the purity (Gini index) used to select the split points or another more specific error function (Hastie, Tibshirani, and Friedman 2009).

The feature importances are then averaged across all of the the decision trees within the model.

As in the previous model we create and train the model on the training data with 63 features

clf_gbt2 = xgb.XGBClassifier(use_label_encoder=False).fit(X_train,np.ravel(y_train))

Now we print the variable importance from the model

var_importance = clf_gbt2.get_booster().get_score(importance_type = 'weight')
var_importance_df = pd.DataFrame(var_importance, index = [1])
print(var_importance)
{'duration': 713, 'nr.employed': 23, 'poutcome_success': 8, 'previous': 36, 'emp.var.rate': 56, 'euribor3m': 403, 'pdays': 54, 'contact_cellular': 22, 'day_of_week_mon': 30, 'cons.price.idx': 82, 'cons.conf.idx': 82, 'age': 375, 'poutcome_failure': 28, 'month_oct': 14, 'housing_no': 45, 'campaign': 149, 'day_of_week_thu': 28, 'housing_unknown': 7, 'day_of_week_tue': 34, 'default_no': 27, 'month_nov': 9, 'education_professional.course': 19, 'job_management': 7, 'housing_yes': 34, 'day_of_week_wed': 32, 'education_basic.9y': 14, 'job_self-employed': 10, 'education_university.degree': 43, 'marital_divorced': 13, 'month_jul': 6, 'education_unknown': 13, 'loan_yes': 13, 'education_basic.6y': 6, 'education_basic.4y': 15, 'month_may': 20, 'job_blue-collar': 19, 'job_services': 15, 'job_technician': 30, 'marital_single': 21, 'day_of_week_fri': 23, 'education_high.school': 25, 'job_admin.': 33, 'marital_married': 35, 'month_mar': 8, 'job_student': 9, 'job_entrepreneur': 9, 'loan_no': 28, 'month_aug': 2, 'month_apr': 10, 'job_retired': 9, 'job_unknown': 5, 'job_unemployed': 2, 'month_dec': 7, 'month_jun': 6, 'job_housemaid': 4, 'month_sep': 1}

Visualisation of best variables

var_importance <- py$var_importance_df
var_importance <- as.data.frame(t(var_importance))
names(var_importance)[1] <- "importance"
var_importance <- tibble::rownames_to_column(var_importance, "variables")
# make importances relative to max importance
var_importance <- var_importance[order(-var_importance$importance), ]
var_importance$importance <- 100 * var_importance$importance/max(var_importance$importance)


fig <- plotly::plot_ly(data = var_importance, x = ~importance, y = ~reorder(variables, 
    importance), name = "Variable Importance", type = "bar", orientation = "h") %>%
    plotly::layout(barmode = "stack", hovermode = "compare", yaxis = list(title = "Variable"), 
        xaxis = list(title = "Variable Importance"))

fig

For reducing the input of our trained model we will filter the X_train dataset with best variables. We set to only filter the variables that have an importance higher or equal than 10%

var_importance_df = r.var_importance
col_names = var_importance_df.variables[var_importance_df["importance"] >= 10]

X_train_reduced = X_train[col_names]
X_eval_reduced = X_eval[col_names]

Now we train a new model on the reduced training data.

clf_gbt2 = xgb.XGBClassifier(use_label_encoder=False).fit(X_train_reduced,np.ravel(y_train))

As in the previous models we look for the optimal cut-off point.

preds = clf_gbt2.predict_proba(X_eval_reduced)
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit']).reset_index(drop = True)
true_df = y_eval

We make predictions on different cut-off points from 0 to 1 by increments of 0.001

numbers  = [float(x)/1000 for x in range(1000)]
for i in numbers:
    preds_df[i]= preds_df.prob_accept_deposit.map(lambda x: 1 if x > i else 0)
preds_df.head(5)
   prob_accept_deposit  0.0  0.001  0.002  ...  0.996  0.997  0.998  0.999
0             0.000756    1      0      0  ...      0      0      0      0
1             0.003237    1      1      1  ...      0      0      0      0
2             0.998340    1      1      1  ...      1      1      1      0
3             0.000782    1      0      0  ...      0      0      0      0
4             0.000104    1      0      0  ...      0      0      0      0

[5 rows x 1001 columns]

We calculate the accuracy and recalls for each of the cut-off points.

cutoff_df = pd.DataFrame( columns = ['prob','accs','def_recalls','nondef_recalls'])
for i in numbers:
    cm1 = metrics.confusion_matrix(true_df, preds_df[i])
    total1=sum(sum(cm1))
    accs = (cm1[0][0]+cm1[1][1])/total1
    
    def_recalls = cm1[1][1]/(cm1[1][1]+cm1[1][0])
    nondef_recalls = cm1[0][0]/(cm1[0][0]+cm1[0][1])
    cutoff_df.loc[i] =[ i ,accs,def_recalls,nondef_recalls]
print(cutoff_df.head(5))
        prob      accs  def_recalls  nondef_recalls
0.000  0.000  0.110392          1.0        0.000000
0.001  0.001  0.304306          1.0        0.217977
0.002  0.002  0.400453          1.0        0.326055
0.003  0.003  0.452736          1.0        0.384825
0.004  0.004  0.489964          1.0        0.426674

This is graphically the best cut-off point in which accuracy and recall is maximised.

cutoff_df <- py$cutoff_df

names(cutoff_df)[names(cutoff_df) == "prob"] <- "Probability cut-off"
names(cutoff_df)[names(cutoff_df) == "accs"] <- "Accuracy"
names(cutoff_df)[names(cutoff_df) == "def_recalls"] <- "Deposit Recall"
names(cutoff_df)[names(cutoff_df) == "nondef_recalls"] <- "No-deposit Recall"

cutoff_df <- cutoff_df %>%
    gather(key = "metric", value = "value", -`Probability cut-off`)

ggplot(cutoff_df, aes(x = `Probability cut-off`, y = value, color = metric)) + geom_line(aes(linetype = metric)) + 
    ggtitle("Accuracy, Deposit Recall and No-deposit Recall") + scale_fill_discrete(name = "Metrics", 
    labels = c("Accuracy", "Deposit Recall", "No-deposit Recall"))

We are able to calculate the best cut-off point

cutoff_df["diff"] = abs(cutoff_df.def_recalls - cutoff_df.nondef_recalls)
best_threshold = cutoff_df["prob"].loc[cutoff_df["diff"] == min(cutoff_df["diff"])]
best_threshold = best_threshold.iloc[0]
print(best_threshold)
0.648

Now we set the optimal cut-off point for the discriminatory process.

preds = clf_gbt2.predict_proba(X_eval_reduced)
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit'])
true_df = y_eval

After that, we can see the results from the classification report:

preds_df['prob_accept_deposit'] = preds_df['prob_accept_deposit'].apply(lambda x: 1 if x > best_threshold else 0)
target_names = ['No-deposit', 'Deposit']
print(classification_report(y_eval, preds_df['prob_accept_deposit'], target_names=target_names))
              precision    recall  f1-score   support

  No-deposit       0.98      0.88      0.93      5496
     Deposit       0.48      0.88      0.62       682

    accuracy                           0.88      6178
   macro avg       0.73      0.88      0.78      6178
weighted avg       0.93      0.88      0.90      6178

We analyse the confusion matrix of this model:

matrix_6 = confusion_matrix(y_eval,preds_df['prob_accept_deposit'])
print(matrix_6)
[[4848  648]
 [  81  601]]

Clearly it has less false negatives than the logistic regressions, however it has slightly more false negatives than the previous gradient boosting model.

We calculate the accuracy of the model:

accuracy_XGB_2 = round((matrix_6[0][0]+matrix_6[1][1])/sum(sum(matrix_6)), 3)
print(accuracy_XGB_2)
0.882

We proceed to calculate the recall

recall_XGB_2 = round(matrix_6[1][1]/(matrix_6[1][1]+matrix_6[1][0]), 2)
print(recall_XGB_2)
0.88

Lastly, we get the AUC.

prob_deposit_xgb_2 = preds[:, 1]
auc_XGB_2 = round(roc_auc_score(y_eval, prob_deposit_xgb_2), 3)
print(auc_XGB_2)
0.943

Cross Validated Gradient Boosting Trees Model

For this model we will use the cross validation technique. We will use the K-fold cross validation technique which works as follows:

1. Take the group as a holdout or test data set

2. Take the remaining groups as a training data set

3. Fit a model on the training set and evaluate it on the test set

4. Retain the evaluation score and discard the model

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. A common practice in Machine Learning projects and the literature is to set this k parameter in 10.

Now we proceed to create a gradient boosted tree model using 3 hyperparameters:

  • Learning rate: The learning rate corresponds to how quickly the error is corrected from each tree to the next and is a simple multiplier and this value is between 0 and 1.

  • Number of trees (n_estimators): It refers to the number of trees used for the ensembling process.

  • max_depth: The maximum depth of a tree. Most commonly used values are [3 - 10]

clf_gbt3 = xgb.XGBClassifier(learning_rate = 0.01, max_depth = 7, n_estimators = 300)

Calculate the cross validation scores for 10 folds

cv_scores = cross_val_score(clf_gbt3, X_train, np.ravel(y_train), cv = 10)
print(cv_scores)

Print the average accuracy and standard deviation of the scores

print("Average accuracy: %0.2f (+/- %0.2f)" % (cv_scores.mean(),
                                              cv_scores.std() * 2))
Average accuracy: 0.89 (+/- 0.02)

We look again for the optimal cut-off point

preds = cross_val_predict(clf_gbt3, X_eval, np.ravel(y_eval), cv=10, method = 'predict_proba')
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit']).reset_index(drop = True)
true_df = y_eval
numbers  = [float(x)/1000 for x in range(1000)]
for i in numbers:
    preds_df[i]= preds_df.prob_accept_deposit.map(lambda x: 1 if x > i else 0)
preds_df.head(5)
   prob_accept_deposit  0.0  0.001  0.002  ...  0.996  0.997  0.998  0.999
0             0.025761    1      1      1  ...      0      0      0      0
1             0.025847    1      1      1  ...      0      0      0      0
2             0.633696    1      1      1  ...      0      0      0      0
3             0.025847    1      1      1  ...      0      0      0      0
4             0.025761    1      1      1  ...      0      0      0      0

[5 rows x 1001 columns]

We calculate recall for each of the cut-off points.

cutoff_df = pd.DataFrame( columns = ['prob','accs','def_recalls','nondef_recalls'])
for i in numbers:
    cm1 = metrics.confusion_matrix(true_df, preds_df[i])
    total1=sum(sum(cm1))
    accs = (cm1[0][0]+cm1[1][1])/total1
    
    def_recalls = cm1[1][1]/(cm1[1][1]+cm1[1][0])
    nondef_recalls = cm1[0][0]/(cm1[0][0]+cm1[0][1])
    cutoff_df.loc[i] =[ i ,accs,def_recalls,nondef_recalls]
print(cutoff_df.head(5))
        prob      accs  def_recalls  nondef_recalls
0.000  0.000  0.110392          1.0             0.0
0.001  0.001  0.110392          1.0             0.0
0.002  0.002  0.110392          1.0             0.0
0.003  0.003  0.110392          1.0             0.0
0.004  0.004  0.110392          1.0             0.0

Graphical representation of the trade-off between recall and accuracy:

cutoff_df <- py$cutoff_df

names(cutoff_df)[names(cutoff_df) == "prob"] <- "Probability cut-off"
names(cutoff_df)[names(cutoff_df) == "accs"] <- "Accuracy"
names(cutoff_df)[names(cutoff_df) == "def_recalls"] <- "Deposit Recall"
names(cutoff_df)[names(cutoff_df) == "nondef_recalls"] <- "No-deposit Recall"

cutoff_df <- cutoff_df %>%
    gather(key = "metric", value = "value", -`Probability cut-off`)

ggplot(cutoff_df, aes(x = `Probability cut-off`, y = value, color = metric)) + geom_line(aes(linetype = metric)) + 
    ggtitle("Accuracy, Deposit Recall and No-deposit Recall") + scale_fill_discrete(name = "Metrics", 
    labels = c("Accuracy", "Deposit Recall", "No-deposit Recall"))

We calculate the best cut-off point:

cutoff_df["diff"] = abs(cutoff_df.def_recalls - cutoff_df.nondef_recalls)
best_threshold = cutoff_df["prob"].loc[cutoff_df["diff"] == min(cutoff_df["diff"])]
best_threshold = best_threshold.iloc[0]
print(best_threshold)
0.172

Predict with a model and store the predictions in a dataframe:

preds = cross_val_predict(clf_gbt3, X_eval, np.ravel(y_eval), cv=10, method = 'predict_proba')
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit']).reset_index(drop = True)
true_df = y_eval

We recode the probabilities as per the cut-off point and analyse the classification report::

preds_df['prob_accept_deposit'] = preds_df['prob_accept_deposit'].apply(lambda x: 1 if x > best_threshold else 0)
target_names = ['No-deposit', 'Deposit']
print(classification_report(true_df, preds_df['prob_accept_deposit'], target_names=target_names))
              precision    recall  f1-score   support

  No-deposit       0.98      0.88      0.93      5496
     Deposit       0.47      0.88      0.61       682

    accuracy                           0.88      6178
   macro avg       0.73      0.88      0.77      6178
weighted avg       0.93      0.88      0.89      6178

We can see that the accuracy and recall balanced doesn’t differ much from the previous XGBoost models.

Now we analyse the confussion matrix.

matrix_7 = confusion_matrix(true_df,preds_df['prob_accept_deposit'])
print(matrix_7)
[[4819  677]
 [  84  598]]

Indeed there are less false negatives, but a bit more of false positives.

We calculate the accuracy by running the following chunk of code:

accuracy_XGB_3 = round((matrix_7[0][0]+matrix_7[1][1])/sum(sum(matrix_7)), 3)
print(accuracy_XGB_3)
0.877

We proceed to calculate the recall for deposits:

recall_XGB_3 = round(matrix_7[1][1]/(matrix_7[1][1]+matrix_7[1][0]), 2)
print(recall_XGB_3)
0.88

Lastly, we calculate the AUC.

prob_deposit_xgb_3 = preds[:, 1]
auc_XGB_3 = round(roc_auc_score(y_eval, prob_deposit_xgb_3), 3)
print(auc_XGB_3)
0.939

After going through different implementations of the XGBoost model, we can compare all of them and choose the best in terms of Recall and AUC. The XGBoost models’ results are depicted in the following table:

data = {'Model': ['Gradient Boosting Trees Model 1', 'Reduced Gradient Boosting Trees Model', 'Cross Validated Gradient Boosting Trees Model'], 
        'Accuracy': [accuracy_XGB_1, accuracy_XGB_2, accuracy_XGB_3],
        'Recall': [recall_XGB_1, recall_XGB_2, recall_XGB_3],
        'AUC': [auc_XGB_1, auc_XGB_2, auc_XGB_3]
        } 
comparison = pd.DataFrame(data) 
print(comparison.sort_values(["Recall", "AUC"], ascending = False))
                                           Model  Accuracy  Recall    AUC
0                Gradient Boosting Trees Model 1     0.883    0.88  0.947
1          Reduced Gradient Boosting Trees Model     0.882    0.88  0.943
2  Cross Validated Gradient Boosting Trees Model     0.877    0.88  0.939

Random Forest

A random forest model builds multiple decision trees and merges them together to get a more accurate and stable prediction. The “forest” it builds, is an ensemble of decision trees, usually trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result.

This first random forest model will be run without any hyperparameter tuning. It will be set by default and per the algorithm. First we train a random forest model:

random_forest = RandomForestClassifier().fit(X_train, np.ravel(y_train))

Then we predict with the model

preds = random_forest.predict_proba(X_eval)

Create dataframes with predictions

preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit'])
true_df = y_eval
pred_comparison = pd.concat([true_df.reset_index(drop = True), preds_df], axis = 1)
pred_comparison.head(10)
   deposit  prob_accept_deposit
0        0                 0.04
1        0                 0.03
2        0                 0.99
3        0                 0.04
4        0                 0.11
5        0                 0.42
6        0                 0.04
7        0                 0.06
8        0                 0.08
9        1                 0.21

Now we look for the optimal cut-off for the discriminatory process:

numbers  = [float(x)/1000 for x in range(1000)]
for i in numbers:
    preds_df[i]= preds_df.prob_accept_deposit.map(lambda x: 1 if x > i else 0)
preds_df.head(5)
   prob_accept_deposit  0.0  0.001  0.002  ...  0.996  0.997  0.998  0.999
0                 0.04    1      1      1  ...      0      0      0      0
1                 0.03    1      1      1  ...      0      0      0      0
2                 0.99    1      1      1  ...      0      0      0      0
3                 0.04    1      1      1  ...      0      0      0      0
4                 0.11    1      1      1  ...      0      0      0      0

[5 rows x 1001 columns]

We calculate the accuracy and recall for each of the cut-off

cutoff_df = pd.DataFrame( columns = ['prob','accs','def_recalls','nondef_recalls'])
for i in numbers:
    cm1 = metrics.confusion_matrix(true_df, preds_df[i])
    total1=sum(sum(cm1))
    accs = (cm1[0][0]+cm1[1][1])/total1
    
    def_recalls = cm1[1][1]/(cm1[1][1]+cm1[1][0])
    nondef_recalls = cm1[0][0]/(cm1[0][0]+cm1[0][1])
    cutoff_df.loc[i] =[ i ,accs,def_recalls,nondef_recalls]
print(cutoff_df.head(5))
        prob      accs  def_recalls  nondef_recalls
0.000  0.000  0.135157          1.0        0.027838
0.001  0.001  0.135157          1.0        0.027838
0.002  0.002  0.135157          1.0        0.027838
0.003  0.003  0.135157          1.0        0.027838
0.004  0.004  0.135157          1.0        0.027838

Graphical representation of the optimal cut-off:

cutoff_df <- py$cutoff_df

names(cutoff_df)[names(cutoff_df) == "prob"] <- "Probability cut-off"
names(cutoff_df)[names(cutoff_df) == "accs"] <- "Accuracy"
names(cutoff_df)[names(cutoff_df) == "def_recalls"] <- "Deposit Recall"
names(cutoff_df)[names(cutoff_df) == "nondef_recalls"] <- "No-deposit Recall"

cutoff_df <- cutoff_df %>%
    gather(key = "metric", value = "value", -`Probability cut-off`)

ggplot(cutoff_df, aes(x = `Probability cut-off`, y = value, color = metric)) + geom_line(aes(linetype = metric)) + 
    ggtitle("Accuracy, Deposit Recall and No-deposit Recall") + scale_fill_discrete(name = "Metrics", 
    labels = c("Accuracy", "Deposit Recall", "No-deposit Recall"))

We calculate the optimal cut-off by running the following chunk of code:

cutoff_df["diff"] = abs(cutoff_df.def_recalls - cutoff_df.nondef_recalls)
best_threshold = cutoff_df["prob"].loc[cutoff_df["diff"] == min(cutoff_df["diff"])]
best_threshold = best_threshold.iloc[0]
print(best_threshold)
0.59

We recode the predictions based on the cut-off point.

preds_df['prob_accept_deposit'] = preds_df['prob_accept_deposit'].apply(lambda x: 1 if x > best_threshold else 0)

Now we can check the classification report.

target_names = ['No-deposit', 'Deposit']
print(classification_report(y_eval, preds_df['prob_accept_deposit'], target_names=target_names))
              precision    recall  f1-score   support

  No-deposit       0.98      0.88      0.93      5496
     Deposit       0.47      0.87      0.61       682

    accuracy                           0.88      6178
   macro avg       0.73      0.88      0.77      6178
weighted avg       0.93      0.88      0.89      6178

We can analyse the confussion matrix of this model.

matrix_8 = confusion_matrix(y_eval,preds_df['prob_accept_deposit'])
print(matrix_8)
[[4825  671]
 [  87  595]]

Now we can proceed to calculate the accuracy of the model as follows:

accuracy_random_forest = round((matrix_8[0][0]+matrix_8[1][1])/sum(sum(matrix_8)), 3)
print(accuracy_random_forest)
0.877

We calculate the recall:

recall_random_forest = round(matrix_8[1][1]/(matrix_8[1][1]+matrix_8[1][0]), 2)
print(recall_random_forest)
0.87

Lastly, we obtain the AUC by running the following chunk of code:

prob_deposit_random_forest = preds[:, 1]
auc_random_forest = round(roc_auc_score(y_eval, prob_deposit_random_forest), 3)
print(auc_random_forest)
0.943

Tuned Random Forest

This time we will set 3 hyperparameters

  • n_estimators: The number of decision trees being built in the forest. Default values in sklearn are 100. N_estimators are mostly correlated to the size of data, to encapsulate the trends in the data, more number of DTs are needed.

  • min_samples_split: This parameter decides the minimum number of samples required to split an internal node. Default value =2. The problem with such a small value is that the condition is checked on the terminal node. If the data points in the node exceed the value 2, then further splitting takes place. Whereas if a more lenient value like 6 is set, then the splitting will stop early and the decision tree wont overfit on the data.

  • min_samples_leaf: The minimum number of samples required to be at a leaf node.

random_forest_2 = RandomForestClassifier(n_estimators=350,min_samples_split=70,min_samples_leaf=7).fit(X_train, np.ravel(y_train))

Then we predict with the model

preds = random_forest_2.predict_proba(X_eval)

Create dataframes with predictions

preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit'])
true_df = y_eval
pred_comparison = pd.concat([true_df.reset_index(drop = True), preds_df], axis = 1)
pred_comparison.head(10)
   deposit  prob_accept_deposit
0        0             0.096282
1        0             0.079678
2        0             0.941734
3        0             0.092387
4        0             0.096555
5        0             0.429765
6        0             0.103480
7        0             0.094218
8        0             0.120510
9        1             0.226968

Now we look for the optimal cut-off for the discriminatory process:

numbers  = [float(x)/1000 for x in range(1000)]
for i in numbers:
    preds_df[i]= preds_df.prob_accept_deposit.map(lambda x: 1 if x > i else 0)
preds_df.head(5)
   prob_accept_deposit  0.0  0.001  0.002  ...  0.996  0.997  0.998  0.999
0             0.096282    1      1      1  ...      0      0      0      0
1             0.079678    1      1      1  ...      0      0      0      0
2             0.941734    1      1      1  ...      0      0      0      0
3             0.092387    1      1      1  ...      0      0      0      0
4             0.096555    1      1      1  ...      0      0      0      0

[5 rows x 1001 columns]

We calculate the accuracy and recall for each of the cut-off

cutoff_df = pd.DataFrame( columns = ['prob','accs','def_recalls','nondef_recalls'])
for i in numbers:
    cm1 = metrics.confusion_matrix(true_df, preds_df[i])
    total1=sum(sum(cm1))
    accs = (cm1[0][0]+cm1[1][1])/total1
    
    def_recalls = cm1[1][1]/(cm1[1][1]+cm1[1][0])
    nondef_recalls = cm1[0][0]/(cm1[0][0]+cm1[0][1])
    cutoff_df.loc[i] =[ i ,accs,def_recalls,nondef_recalls]
print(cutoff_df.head(5))
        prob      accs  def_recalls  nondef_recalls
0.000  0.000  0.110392          1.0             0.0
0.001  0.001  0.110392          1.0             0.0
0.002  0.002  0.110392          1.0             0.0
0.003  0.003  0.110392          1.0             0.0
0.004  0.004  0.110392          1.0             0.0

Graphical representation of the optimal cut-off:

cutoff_df <- py$cutoff_df

names(cutoff_df)[names(cutoff_df) == "prob"] <- "Probability cut-off"
names(cutoff_df)[names(cutoff_df) == "accs"] <- "Accuracy"
names(cutoff_df)[names(cutoff_df) == "def_recalls"] <- "Deposit Recall"
names(cutoff_df)[names(cutoff_df) == "nondef_recalls"] <- "No-deposit Recall"

cutoff_df <- cutoff_df %>%
    gather(key = "metric", value = "value", -`Probability cut-off`)

ggplot(cutoff_df, aes(x = `Probability cut-off`, y = value, color = metric)) + geom_line(aes(linetype = metric)) + 
    ggtitle("Accuracy, Deposit Recall and No-deposit Recall") + scale_fill_discrete(name = "Metrics", 
    labels = c("Accuracy", "Deposit Recall", "No-deposit Recall"))

We calculate the optimal cut-off by running the following chunk of code:

cutoff_df["diff"] = abs(cutoff_df.def_recalls - cutoff_df.nondef_recalls)
best_threshold = cutoff_df["prob"].loc[cutoff_df["diff"] == min(cutoff_df["diff"])]
best_threshold = best_threshold.iloc[0]
print(best_threshold)
0.596

We recode the predictions based on the cut-off point.

preds_df['prob_accept_deposit'] = preds_df['prob_accept_deposit'].apply(lambda x: 1 if x > best_threshold else 0)

Now we can check the classification report.

target_names = ['No-deposit', 'Deposit']
print(classification_report(y_eval, preds_df['prob_accept_deposit'], target_names=target_names))
              precision    recall  f1-score   support

  No-deposit       0.98      0.87      0.92      5496
     Deposit       0.46      0.87      0.60       682

    accuracy                           0.87      6178
   macro avg       0.72      0.87      0.76      6178
weighted avg       0.92      0.87      0.89      6178

We can analyse the confussion matrix of this model.

matrix_9 = confusion_matrix(y_eval,preds_df['prob_accept_deposit'])
print(matrix_9)
[[4790  706]
 [  88  594]]

Now we can proceed to calculate the accuracy of the model as follows:

accuracy_random_forest_2 = round((matrix_9[0][0]+matrix_9[1][1])/sum(sum(matrix_9)), 3)
print(accuracy_random_forest_2)
0.871

We calculate the recall:

recall_random_forest_2 = round(matrix_9[1][1]/(matrix_9[1][1]+matrix_9[1][0]), 2)
print(recall_random_forest_2)
0.87

Lastly, we obtain the AUC by running the following chunk of code:

prob_deposit_random_forest_2 = preds[:, 1]
auc_random_forest_2 = round(roc_auc_score(y_eval, prob_deposit_random_forest_2), 3)
print(auc_random_forest_2)
0.94

Model Comparison

In this section we compare the performance of the models based on the selected metric Recall. Our goal is to choose the model with highest Recall so our model correctly classifies the customers that will subscribe to a term deposit and those that will not

Logistic Regression Models

In this section we will analyse the performance of the 3 logistic regression models used:

  1. Logistic Regression Model (simple)
  2. Regularized Logistic Regression Model
  3. Reduced Logistic Regression Model

The results are found in the following table:

                                   Model  Accuracy  Recall    AUC
0            Logistic Regression Model 1     0.871   0.870  0.939
1  Regularized Logistic Regression Model     0.859   0.859  0.927
2      Reduced Logistic Regression Model     0.734   0.740  0.793

From the previous table we can infer that the model that performed the best is: Logistic Regression Model Simple.

In terms of our selected metric: Recall. This model obtained the best score.

We can compare graphically each of the ROC curves on the validation dataset of the logistic regression models.

Gradient Boosting Tree Models

In this section we will analyse the performance of the 3 Gradient Boosting Tree Model used:

  1. Gradient Boosting Tree Model (simple)
  2. Reduced Gradient Boosting Trees Model
  3. Cross Validated Gradient Boosting Trees Model

The results are found in the following table:

                                           Model  Accuracy  Recall    AUC
0                Gradient Boosting Trees Model 1     0.883    0.88  0.947
1          Reduced Gradient Boosting Trees Model     0.882    0.88  0.943
2  Cross Validated Gradient Boosting Trees Model     0.877    0.88  0.939

The best model in terms of higher recall is the Gradient Boosting Trees Model 1. We can compare all the ROC curves on the validation dataset:

Random Forest

In this section we will analyse the performance of the 2 Random Forest Model used:

  1. Random Forest Simple Model
  2. Tuned Random Forest Model

The results are found in the following table:

                 Model  Accuracy  Recall    AUC
0        Random Forest     0.877    0.87  0.943
1  Tuned Random Forest     0.871    0.87  0.940

The best model is the random forest simple which was tuned with the default features of the algorithm. We can visualise the ROC curves as follows:

Model Selection

The following table shows the performance of all models. The table is sorted descending by terms of Recall and AUC. It is important to note that the value of the Area Under the Curve (AUC) was calculated on the validation set and it will be presented as a reference. That being said, the selection of the model will be based only on the selected metric criteria: Recall. The models being compared are the following:

  • Logistic Regression Model 1

  • Regularized Logistic Regression Model

  • Reduced Logistic Regression Model

  • Gradient Boosting Trees Model 1

  • Reduced Gradient Boosting Trees Model

  • Cross Validated Gradient Boosting Trees Model

  • Random Forest

  • Tuned Random Forest

                                           Model  Accuracy  Recall    AUC
3                Gradient Boosting Trees Model 1     0.883   0.880  0.947
4          Reduced Gradient Boosting Trees Model     0.882   0.880  0.943
5  Cross Validated Gradient Boosting Trees Model     0.877   0.880  0.939
6                                  Random Forest     0.877   0.870  0.943
7                            Tuned Random Forest     0.871   0.870  0.940
0                    Logistic Regression Model 1     0.871   0.870  0.939
1          Regularized Logistic Regression Model     0.859   0.859  0.927
2              Reduced Logistic Regression Model     0.734   0.740  0.793

The model with the best performance was the Gradient Boosting Trees Model.

Since the recall is balanced for both labels (deposits and no-deposit), this model predicts 89% of the clients that will subscribe to a term deposit, and identify 89% of clients that will not subscribe to a term deposit.

The AUC is 0.947 which is a high value, indicating that this is a very good model.

Now we proceed to plot the ROC curves of all models:

Model Assesment

As mentioned in the previous section, the model with the best performance in terms of the Recall metric was the Gradient Boosting Trees Model. In this section we will make predictions on the test and train set and calculate the recall, accuracy and ROC curve and compare the results.

In this section we intend to investigate whether our chosen model is overfitted. As previously stated, the Gradient Boosting Models tend to overfit quickly. Overfitting is a situation where any given model is performing too well on the training data but the performance drops significantly over the test set.

In order to detect overfitting we will calculate the metrics Recall and Accuracy, along with the AUC for each dataset: train, validation and test. Then we will compare how much differ the metrics with respect to the training set.

Predictions

Since our model was evaluated before in the previous sections in terms of the validation set. Here we will calculate the model performance when exposed to the test and training datasets.

Test Set:

First we make predictions using the test set:

preds = clf_gbt.predict_proba(X_test)
C:\Users\JOSECA~1\ANACON~1\lib\site-packages\xgboost\data.py:112: UserWarning: Use subset (sliced data) of np.ndarray is not recommended because it will generate extra copies and increase memory consumption
  warnings.warn(
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit'])
true_df = y_test

We look for the optimal cut-off point

numbers  = [float(x)/1000 for x in range(1000)]
for i in numbers:
    preds_df[i]= preds_df.prob_accept_deposit.map(lambda x: 1 if x > i else 0)
    
cutoff_df = pd.DataFrame( columns = ['prob','accs','def_recalls','nondef_recalls'])
for i in numbers:
    cm1 = metrics.confusion_matrix(true_df, preds_df[i])
    total1=sum(sum(cm1))
    accs = (cm1[0][0]+cm1[1][1])/total1
    
    def_recalls = cm1[1][1]/(cm1[1][1]+cm1[1][0])
    nondef_recalls = cm1[0][0]/(cm1[0][0]+cm1[0][1])
    cutoff_df.loc[i] =[ i ,accs,def_recalls,nondef_recalls]
print(cutoff_df.head(5))
        prob      accs  def_recalls  nondef_recalls
0.000  0.000  0.115553          1.0        0.000000
0.001  0.001  0.376598          1.0        0.295151
0.002  0.002  0.454442          1.0        0.383166
0.003  0.003  0.490371          1.0        0.423788
0.004  0.004  0.513190          1.0        0.449588

Graphical representation of the best cut-off point:

cutoff_df <- py$cutoff_df

names(cutoff_df)[names(cutoff_df) == "prob"] <- "Probability cut-off"
names(cutoff_df)[names(cutoff_df) == "accs"] <- "Accuracy"
names(cutoff_df)[names(cutoff_df) == "def_recalls"] <- "Deposit Recall"
names(cutoff_df)[names(cutoff_df) == "nondef_recalls"] <- "No-deposit Recall"

cutoff_df <- cutoff_df %>%
    gather(key = "metric", value = "value", -`Probability cut-off`)

ggplot(cutoff_df, aes(x = `Probability cut-off`, y = value, color = metric)) + geom_line(aes(linetype = metric)) + 
    ggtitle("Accuracy, Deposit Recall and No-deposit Recall") + scale_fill_discrete(name = "Metrics", 
    labels = c("Accuracy", "Deposit Recall", "No-deposit Recall"))

Calculation of the best cut-off point:

cutoff_df["diff"] = abs(cutoff_df.def_recalls - cutoff_df.nondef_recalls)
best_threshold = cutoff_df["prob"].loc[cutoff_df["diff"] == min(cutoff_df["diff"])]
best_threshold = best_threshold.iloc[0]
print(best_threshold)
0.656

Recode the probabilities as per the best cut-off point:

preds_df['prob_accept_deposit'] = preds_df['prob_accept_deposit'].apply(lambda x: 1 if x > best_threshold else 0)

We check the confusion matrix:

matrix_10 = confusion_matrix(y_test,preds_df['prob_accept_deposit'])
print(matrix_10)
[[4761  704]
 [  92  622]]

We calculate the accuracy for the test set.

accuracy_XGB_1_test = round((matrix_10[0][0]+matrix_10[1][1])/sum(sum(matrix_10)), 3)
print(accuracy_XGB_1_test)
0.871

Now we proceed to calculate the Recall for the test set.

recall_XGB_1_test = round(matrix_10[1][1]/(matrix_10[1][1]+matrix_10[1][0]), 2)
print(recall_XGB_1_test)
0.87

We calculate the AUC for the test set

prob_deposit_xgb_1_test = preds[:, 1]
auc_XGB_1_test = round(roc_auc_score(y_test, prob_deposit_xgb_1_test), 3)
print(auc_XGB_1_test)
0.94

Train set:

First we make predictions using the train set:

preds = clf_gbt.predict_proba(X_train)
C:\Users\JOSECA~1\ANACON~1\lib\site-packages\xgboost\data.py:112: UserWarning: Use subset (sliced data) of np.ndarray is not recommended because it will generate extra copies and increase memory consumption
  warnings.warn(
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_accept_deposit'])
true_df = y_train

We recode the probabilities as per the best cut-off point:

preds_df['prob_accept_deposit'] = preds_df['prob_accept_deposit'].apply(lambda x: 1 if x > best_threshold else 0)

We check the confusion matrix:

matrix_11 = confusion_matrix(y_train,preds_df['prob_accept_deposit'])
print(matrix_11)
[[3214   30]
 [ 134 3110]]

We calculate the accuracy for the train set.

accuracy_XGB_1_train = round((matrix_11[0][0]+matrix_11[1][1])/sum(sum(matrix_11)), 3)
print(accuracy_XGB_1_train)
0.975

Now we proceed to calculate the Recall for the train set.

recall_XGB_1_train = round(matrix_11[1][1]/(matrix_11[1][1]+matrix_11[1][0]), 2)
print(recall_XGB_1_train)
0.96

We calculate the AUC for the train set

prob_deposit_xgb_1_train = preds[:, 1]
auc_XGB_1_train = round(roc_auc_score(y_train, prob_deposit_xgb_1_train), 3)
print(auc_XGB_1_train)
0.998

Overfitting Assesment

So far we have calculated the recall, accuracy and AUC when the input data comes from the train, validation and test set. Now we can compare the ROC curves for each of the datasets.

As expected, the train dataset has a very good result since the model was trained and executed on the same data. However, it severely differs when the input data is unknown (validation and test set).

The difference between the metrics Recall, Accuracy, and AUC in the train, validation, and test set is depicted as follows:

          Dataset  Accuracy  Recall    AUC
2       Train set     0.975    0.96  0.998
0  Validation Set     0.883    0.88  0.947
1        Test Set     0.871    0.87  0.940

We can see that the model performs extremely good on the training set as expected, however, since the predictions change drastically when new input data is presented, and therefore changes the recall and accuracy of the model, we can say that this model is overfitted. Note that a difference in Recall is between 9 and 10% when exposed to unknown data is a bad indicator for this assessment.

Although this model was chosen among others due to its good performance in the metric recall, unfortunately the model is prone to be easily overfitted, which ended up being the case.

Conclusions

Our main goal was to create a model that best identifies and classifies those customers that will subscribe to a term deposit, and also identify those that will not. For this we built 8 models and out of these models, the one that showed a great performance was the Gradient Boosting Tree model with no tuning or regularization.

As mentioned before, this model by its nature can quickly overfit and ended up being this the case. Our model reported great results in the training data and when exposed to the test and validation, this model dropped between 9 - 10 % in the value of the metrics. However, the value of the metrics when exposed to the 2 unknown datasets (test and validation) differed around 1%, showing some kind of robustness in the predictions.

Our Gradient Boosting Tree model, although is overfitted, can predict around 88% of the 2 categories: customers that will subscribe to a term deposit and customers that will not. The error of accuracy when new data is present is around 1%.

Lastly, we can identify the factors behind the influence of a customer to subscribe to a term deposit or not. As per our analysis of best variables in the Reduced Gradient Boosting Trees Model section, we found that duration, euribor3m, age, campaign and cons.price.idx are the key determinants in the probability of a customer to subscribe to a term deposit.

One of the biggest problems found in this project is to set hyperparameters correctly. A good approach for that is to test on a grid search, however, the execution time and the computational power requirement for training a model on different hyperparameters to then compare performance, and do cross validation, is quite big. We tried to used different approaches in order to add variety in the estimation.

As a proposal for improvement of this model, the overfitting problem could be improved by trying different hyperparameters and tunning them accordingly by a grid search. Also, we could think on trying a different sampling technique, for example, oversampling. Also, we could do some dimensionality reduction and try regularisation techniques on all the models. Then we can choose the model that reports the best Recall. Alternatively this analysis can be performed on numerical and categorical variables separately.

Bibliography

Gupta, Shivani, and Atul Gupta. 2019. “Dealing with Noise Problem in Machine Learning Data-Sets: A Systematic Review.” Procedia Computer Science 161: 466–74. https://doi.org/10.1016/j.procs.2019.11.146.
Hastie, Trevor, Saharon Rosset, Ji Zhu, and Hui Zou. 2009. “Multi-Class AdaBoost.” Statistics and Its Interface 2 (3): 349–60. https://doi.org/10.4310/sii.2009.v2.n3.a8.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning. Springer New York. https://doi.org/10.1007/978-0-387-84858-7.